# **Analyzing Transcripts in Resident Evil: The Prepwork**
## **About this Project**
In this project, we'll be analyzing the historical characterization of male and female characters in the *Resident Evil* series by processing fan-made transcripts from [GameFAQs](https://gamefaqs.gamespot.com/) and the [Game Scripts Wiki](https://game-scripts-wiki.blogspot.com/p/resident-evil-series.html) and perform numerical analyses on these transcripts.

### How many games does that project entail?
With ten mainline releases and three remakes (as of 2024), we'll be looking at transcripts for **thirteen** *Resident Evil* games. In the interest of transparency, here are all of the games we are including in our analysis:
* [*Resident Evil 0*](https://en.wikipedia.org/wiki/Resident_Evil_Zero) (2002)
* [*Resident Evil 1*](https://en.wikipedia.org/wiki/Resident_Evil_(1996_video_game)) (1996)
* [*Resident Evil 2*](https://en.wikipedia.org/wiki/Resident_Evil_2) (1998)
* [*Resident Evil 2* Remake](https://en.wikipedia.org/wiki/Resident_Evil_2_(2019_video_game)) (2019)
* [*Resident Evil 3: Nemesis*](https://en.wikipedia.org/wiki/Resident_Evil_3:_Nemesis) (1999)
* [*Resident Evil 3* Remake](https://en.wikipedia.org/wiki/Resident_Evil_3_(2020_video_game)) (2020)
* [*Resident Evil - Code: Veronica*](https://en.wikipedia.org/wiki/Resident_Evil_%E2%80%93_Code:_Veronica) (2000)
* [*Resident Evil 4*](https://en.wikipedia.org/wiki/Resident_Evil_4) (2005)
* [*Resident Evil 4* Remake](https://en.wikipedia.org/wiki/Resident_Evil_4_(2023_video_game)) (2023)
* [*Resident Evil 5*](https://en.wikipedia.org/wiki/Resident_Evil_5) (2009)
* [*Resident Evil 6*](https://en.wikipedia.org/wiki/Resident_Evil_6) (2012)
* [*Resident Evil 7*](https://en.wikipedia.org/wiki/Resident_Evil_7:_Biohazard) (2017)
* [*Resident Evil Village*](https://en.wikipedia.org/wiki/Resident_Evil_Village) (2021)

### Why focus on *Resident Evil*?
We chose to focus on the *Resident Evil* games because of their theme and their history; the *Resident Evil* franchise has been around since 1996 and focuses on survival horror, a genre that typically explores unequal power dynamics (monster versus humans and human versus human). We thought that *Resident Evil* transcripts would provide some interesting historical data on how male and female characters have been portrayed in games and whether the notions of games being a male-dominated space have actually changed in recent decades.

### But why focus on male and female characters? Isn't gender more complex?
While we acknowledge that gender is not strictly binary, the *Resident Evil* series doesn't prominently feature character who identify outside of the male-female binary. Most of the characters are identified with he / him or she / her pronouns (and in some cases, it / its pronouns if the character's gender identity is unspecified). To the best of our knowledge, none of the characters in the *Resident Evil* games are explicitly identified as non-binary, agender, or genderfluid characters. Since we are trying our limit our focus to historical analysis, we are choosing to only use canon gender identities, which results in us adhering to that strict gender binary.

### You mentioned using fan-transcribed dialogues for the analysis. Why not extract the data directly from the game yourselves?
The simple answer is that our team lacks the resources to perform direct textual extraction from the games. No one in our team owns the entire *Resident Evil* series; in fact, out of the whole team, only two of us have played a *Resident Evil* game. We are thus relying on fan-transcribed dialogues to gather the data for our analysis.

### If you haven't played a *Resident Evil* game, why are you choosing to look at *Resident Evil*?
We feel that a familiarity with a game shouldn't be a limiting factor when conducting an analysis. While personal passion can enchance an analysis, the reality is that we are sometimes called to do projects that lie outside our our immediate interests and expertise; I, for instance, would rather explore topics like game taxonomy than character dialogues. But we believe that gender representation in games is an important issue that warrants exploration -- and *Resident Evil* transcripts give us a means of exploring that subject matter.

## **Step 1: Extracting the Data**
As we mentioned earlier in this notebook, we're compiling our initial dataset from [GameFAQs](https://gamefaqs.gamespot.com/) and the [Game Scripts Wiki](https://game-scripts-wiki.blogspot.com/p/resident-evil-series.html). We accomplished this part of the project by copying and pasting data from webpages into .txt files.

You might be wondering why we didn't automate this process. When we attempted to scrape data from the websites, some of our scraping attempts were blocked for unidentifiable reasons. Out of concern, we checked the `robots.txt` files but found that data extraction was permitted; we were not violating any website usage agreements. In the interest of time, we decided it would be best for us to manually collect the data and clean it later in the project, rather than try to come up with a working automated solution.

Assume that the transcripts have been copied and pasted into .txt files stored within the **current working directory**, which is the directory that contains this Jupyter notebook.
```
is-278-resident-evil-data-viz/
├── re-all-data / <-- This is the working directory
|   ├── RE_Data_Setup.ipynb 
|   └── RE0_Transcript.txt
```
To open and read the contents of our .txt files, we need to know where the files are stored. This information can be retrieved with the [`os`](https://docs.python.org/3/library/os.html) module.

In [33]:
import os

def get_path_to_file(file: str) -> str:
    """Returns the path to a file in the working directory.
    Note that there is not error-checking, so the function will return a path
    even if the file doesn't exist within the working directory."""
    return os.path.realpath(file)

# This should print out the path to the .ipynb file
# get_path_to_file('RE_Data_Setup.ipynb')

### Retrieving the Lines of Dialogue from the Transcripts
We know from skimming the transcripts that the dialogues are formatted in the format `SPEAKER: LINE`, where `SPEAKER` is the name of a character (i.e. "Barry") and `LINE` is that character's line of dialogue (i.e. "That was too close. You were almost a Jill sandwich.") Based on this pattern, we can create a **regular expression (regex)** that finds and matches lines in the format and use the [`re`](https://docs.python.org/3/library/re.html) module to get those matches.

Our regex is `^[A-Z]{1}[A-Za-z '.]+:{1}(.+[\n|\r\n|\r])*`. It'll match any line with the following characteristics:
* The first part of the sentence must contain a word that begins with a capital letter and is followed by one or more letters, spaces, or apostrophes
* The second part of the sentence must contain a character followed by a newline character
* Both parts of the sentence must be separated by exactly one colon

As an unintended consequence, this method will also capture lines that are formatted like dialogue but are not actually lines of dialogue. For example, the line "To: Emily" would be captured by the above regex, even though the line represents an email or letter header. Such mistakes should be captured and removed during the data validation process.

To save time, we're going to perform preliminary data cleanup while we attempt to extract the lines of dialogue. If we find any lines that match the `SPEAKER: LINE` format, we are going to:
* Remove extraneous carriage return characters (`\n`, `\r`, and `\r\n`) and spaces
* Remove extraneous punctuation (i.e. extra quotation marks around lines of dialogue that have been added in by transcribers for stylistic reasons)
* Standardize the spelling of character names such that every part of the name is capitalized (i.e. "JILL" becomes "Jill" and "Woman in red" becomes "Woman In Red")

Our data will be saved into new .txt files.

In [34]:
import re # Allows us to capture text using regular expressions (regex)

def get_speaker(line: str) -> str:
    """Returns the name of the person who spoke a line of dialogue."""
    return line[:line.find(':')].strip()

def get_speech(line: str) -> str:
    """Returns a line of dialogue."""
    return line[line.find(':') + 2:].strip() # (+2) to skip the colon and space

def replace_newline_characters(line: str) -> str:
    """Replaces all newline characters with spaces."""
    return line.replace('\n', ' ').replace('\r\n', ' ').replace('\r', ' ')

def replace_typewriter_space(line: str) -> str:
    """Replaces two spaces with a single space.
    The convention supposedly comes from the era of typewriters, where typists
    would represent spaces with two space characters instead of a single
    space character.
    """
    return line.replace('  ', ' ')

def replace_quotation_marks(line: str) -> str:
    """Removes double quotation marks."""
    return line.replace('"', '')

def capitalize_speaker(line: str) -> str:
    """Standardizes speaker name formats by capitalizing names.
    Note that this function may fail to properly capitalize certain
    names and titles.
    
    Examples:
    * "SHEVA ALOMAR" becomes "Sheva Alomar"
    * "Captain DeChant" becomes "Captain Dechant"
    * "Woman in red" becomes "Woman In Red"
    """
    return get_speaker(line).title() + ': ' + get_speech(line) + '\n'

def get_cleaned_line(line: str) -> str:
    """Standardizes line formatting.
    Standardization consists of replacing spaces and carriage return characters,
    followed by capitalizing speakers' names.
    """
    line = replace_newline_characters(line)
    line = replace_typewriter_space(line)
    line = replace_quotation_marks(line)
    line = capitalize_speaker(line)
    return line 

def get_dialogue(input: str) -> list:
    """Returns a list of lines representing direct game dialogue."""
    dialouge = []
    with open(input, 'r', encoding = 'utf-8') as file:
        for line in re.finditer("^[A-Z]{1}[A-Za-z '.]+:{1}(.+[\n|\r\n|\r])*", 
                               file.read(), re.MULTILINE):
            line = get_cleaned_line(line.group(0).rstrip())
            if (len(line[line.find(':') + 2:]) == 1): # Skip empty lines
                continue
            dialouge.append(line)
    return dialouge

def save_dialogue(output: str, dialogue: list) -> None:
    """Saves extracted dialogue to a new .txt file.
    Note that if a file with the same name exists in the current working
    directory, that file will be replaced.

    Keyword Arguments:
    * output -- The path where this file should be saved.
    * dialogue -- A list of extracted and pre-cleaned lines.
    """
    with open(output, 'w', encoding = 'utf-8', newline = '\n') as file:
        for line in dialogue:
            file.write(line)

def extract_dialogue(input: str, output: str) -> None:
    """Extracts dialogue from a .txt file and saves results to a .csv file.
    
    Keyword Arguments:
    * input -- The path containing the original transcripts
    * output -- The path where the generated .csv files should be stored
    """
    save_dialogue(output, get_dialogue(input))

files = ['RE0_Transcript.txt',
         'RE1_Transcript.txt',
         'RE2_Transcript.txt',
         'RE2_Remake_Transcript.txt',
         'RE3_Transcript.txt',
         'RE3_Remake_Transcript.txt',
         'RE4_Transcript.txt',
         'RE4_Remake_Transcript.txt',
         'RE5_Transcript.txt',
         'RE6_Transcript.txt',
         'RE7_Transcript.txt',
         'RE8_Transcript.txt', # Transcript for Resident Evil Village
         'REV_Transcript.txt'] # Transcript for Resident Evil Code: Veronica

for file in files:
    extract_dialogue(get_path_to_file(file), 
                     get_path_to_file(file.split('.')[0] + '_Extracted.txt'))
    
# Running this cell should generate 13 .txt files in your working directory.

## **Step 2: Validating the Data**
After extracting the dialogue from the transcripts, we can open these files and check them for obvious errors.<br>
In the context of this project, "obvious" errors would refer to clearly mis-captured lines, such as improperly separated or missing dialogues.

To perform these checks, we cross-referenced the dialogue we extracted against dialogue found in other wikis and videos. The transcript for *Resident Evil 0*, for instance, was cross-referenced against the [*Resident Evil Wiki*'s *Resident Evil 0* transcripts](https://residentevil.fandom.com/wiki/Template:Resident_Evil_0_cutscenes).

An interesting error we noticed was a large amount of missing dialogue for *Resident Evil 6*, where the transcriber only managed to capture dialogue for two of the four playable campaigns. To create a more complete dataset, we had to copy and paste lines from the [*Resident Evil Wiki*](https://residentevil.fandom.com/wiki/Resident_Evil_Wiki) (which could arguably be considered more “accurate” in the sense that the *Resident Evil Wiki*’s transcriptions are derived from Capcom’s [*BIOHAZARD 6 STORY GUIDE*](https://residentevil.fandom.com/wiki/BIOHAZARD_6_STORY_GUIDE), a book that contains the original Japanese script for *Resident Evil 6*). 

## **Step 3: Enriching and Cleaning the Data**
We can now try to convert the contents of our processed .txt files into pandas.DataFrames for easier data reorganization and analyses. This requires us to install the [`pandas`](https://pandas.pydata.org/) library.

In [72]:
import pandas

# Copying functions for clarity
# These functions were defined in Step 1
def get_speaker(line: str) -> str:
    """Returns the name of the person who spoke a line of dialogue."""
    return line[:line.find(':')].strip()

def get_speech(line: str) -> str:
    """Returns a line of dialogue."""
    return line[line.find(':') + 2:].strip() # (+2) to skip the colon and space

def get_dataframe(input: str) -> pandas.DataFrame:
    """Creates a dataframe from a transcript.
    
    Keyword Arguments:
    input -- The path to the file containing the extracted dialogue.
    label -- The name of the game that dialogue belongs to.
    """
    data = []
    with open(input, 'r', encoding = 'utf-8') as file:
        for line in file:
            data.append([get_speaker(line), get_speech(line)])
        return pandas.DataFrame(data, columns = ['Character', 'Line'])
    
# Here is an example of what the code does using the Resident Evil 0 transcript
re0_data = get_dataframe(get_path_to_file('RE0_Transcript_Extracted.txt'))
re0_data

Unnamed: 0,Character,Line
0,Narrator,A small mid-western town in America: Raccoon C...
1,Man 1,Really?
2,Woman,"Hmm, do you think so too?"
3,Man 2,Yeah.
4,Man 3,...do about it?
...,...,...
228,Billy,"Rebecca, hurry!"
229,Rebecca,Hey that must be the old mansion Enrico was ta...
230,Rebecca,"I guess it's time to say goodbye. Officially, ..."
231,Billy,"Yeah, I'm just a zombie now."


To make data analysis easier, we might want to consolidate similar pieces of information or add pieces of information to the existing dataset.

For example, one change we might want to make to our dataset is standardizing names within the dataset. If you've looked through the *Resident Evil 0* transcript, you might have noticed that characters sometimes are referenced with different aliases. The character, Rebecca, for instance, has lines under the alias `Rebecca` and `Rebecca's Voice`. It would easier to do data analysis if all the lines were grouped under the name `Rebecca`.<br>
We can see all the characters contained within the *Resident Evil 0* dataset by calling the `unique()` function on the `Character` column.

In [73]:
re0_data['Character'].unique()

array(['Narrator', 'Man 1', 'Woman', 'Man 2', 'Man 3', 'Man 4', 'Man 5',
       'Man 6', 'Rebecca', 'Enrico', 'Edward', 'Billy', 'Soldier',
       'Wesker', 'Birkin', 'Voice', 'Leech', 'Commander', 'Computer',
       "Rebecca's Voice", 'Marcus', 'Computer Voice'], dtype=object)

We can then replace related character names using a dictionary.

In [74]:
aliases = {"Computer Voice": "Computer",
           "Narrator's Voice": "Narrator",
           "Rebecca's Voice": "Rebecca"}

re0_data['Character'] = re0_data['Character'].replace(aliases)
re0_data['Character'].unique()

array(['Narrator', 'Man 1', 'Woman', 'Man 2', 'Man 3', 'Man 4', 'Man 5',
       'Man 6', 'Rebecca', 'Enrico', 'Edward', 'Billy', 'Soldier',
       'Wesker', 'Birkin', 'Voice', 'Leech', 'Commander', 'Computer',
       'Marcus'], dtype=object)

Because we are also interested in tracking gender representation, we may also want to assign each character a gender.

In [75]:
re0_genders = {'Man 1': 'Male',
               'Woman': 'Female',
               'Man 2': 'Male',
               'Man 3': 'Male',
               'Man 4': 'Male',
               'Man 5': 'Male',
               'Man 6': 'Male',
               'Rebecca': 'Female',
               'Enrico': 'Male',
               'Edward': 'Male',
               'Billy': 'Male',
               'Wesker': 'Male',
               'Birkin': 'Male',
               'Marcus': 'Male'}

re0_data['Gender'] = re0_data['Character'].map(re0_genders).fillna(pandas.NA)
re0_data.insert(1, 'Gender', re0_data.pop('Gender')) # Shift the order of columns
re0_data

Unnamed: 0,Character,Gender,Line
0,Narrator,,A small mid-western town in America: Raccoon C...
1,Man 1,Male,Really?
2,Woman,Female,"Hmm, do you think so too?"
3,Man 2,Male,Yeah.
4,Man 3,Male,...do about it?
...,...,...,...
228,Billy,Male,"Rebecca, hurry!"
229,Rebecca,Female,Hey that must be the old mansion Enrico was ta...
230,Rebecca,Female,"I guess it's time to say goodbye. Officially, ..."
231,Billy,Male,"Yeah, I'm just a zombie now."


As well as a role. In this case, we'll classify the characters into playable and non-playable categories.
* A **playable** character is a character that can be controlled **from the start of the game** (this excludes characters like **Wesker** and **Rose**, who are only playable in specific segments of the game or in separate DLCs)
* A **non-playable** character is a character that cannot be controlled during normal gameplay

### Why would you want to track the roles?
In video games, playable characters typically have more lines of dialogue that non-playable characters. This is because playable characters take up the majority of the game's screentime; these characters not only drive the action and story forward but also happen to be the characters the players spend the most time with. 

In other words, playable characters skew dialogue data. Games with more playable characters are more likely to have more dialogue than games with fewer playable characters.

### But if that's the case, why not classify characters by narrative role? Don't villains also contribute to dialogue count?
Role assignment is subjective. Our perception of a character's importance is often influenced by the amount of time we spend with them. For example, some players might view Sheva as a protagonist in *Resident Evil 5* because she is a controllable character and a persistant ally. But *Resident Evil 5*'s plot generally focuses on Chris's past; Sheva's exists to guide and assist Chris. Sheva thus could be considered a deuteragonist.

The same idea applies to villains. To give another example, Annette Birkin in *Resident Evil 2* and its remake could be considered a minor antagonist or supporting character depending on how players interpret Annette's actions and whether players hold Annette responsible for the events of the game. 

### Why are we skipping certain playable characters? Isn't the definition of playable "controllable"?
While the term "playable" does refer to any controllable character, characters who are controllable for brief periods of time typically have minimal impact on the game and the plot. Wesker, for instance, is technically a playable character but is only controllable in a special mode in certain games.

We also feel that using the colloquial definition of "playable character" -- a character who has significant screentime and is playable in the main campaign -- would result in a more meaningful data analysis.

In [76]:
re0_playable_characters = {'Rebecca': 'Playable',
                           'Billy': 'Playable'}

re0_data['Playable'] = re0_data['Character'].map(re0_playable_characters) \
                       .fillna('Non-Playable')
re0_data.insert(2, 'Playable', re0_data.pop('Playable')) # Shift the order of columns
re0_data

Unnamed: 0,Character,Gender,Playable,Line
0,Narrator,,Non-Playable,A small mid-western town in America: Raccoon C...
1,Man 1,Male,Non-Playable,Really?
2,Woman,Female,Non-Playable,"Hmm, do you think so too?"
3,Man 2,Male,Non-Playable,Yeah.
4,Man 3,Male,Non-Playable,...do about it?
...,...,...,...,...
228,Billy,Male,Playable,"Rebecca, hurry!"
229,Rebecca,Female,Playable,Hey that must be the old mansion Enrico was ta...
230,Rebecca,Female,Playable,"I guess it's time to say goodbye. Officially, ..."
231,Billy,Male,Playable,"Yeah, I'm just a zombie now."


We can save our data to a `.csv` file using `pandas`.

In [77]:
re0_data.to_csv(get_path_to_file('RE0_Transcript.csv'), encoding = 'utf-8', index = False)

We then have to repeat this process for all the other transcripts.

In [78]:
files = ['RE1_Transcript_Extracted.txt',
         'RE2_Transcript_Extracted.txt',
         'RE2_Remake_Transcript_Extracted.txt',
         'RE3_Transcript_Extracted.txt',
         'RE3_Remake_Transcript_Extracted.txt',
         'RE4_Transcript_Extracted.txt',
         'RE4_Remake_Transcript_Extracted.txt',
         'RE5_Transcript_Extracted.txt',
         'RE6_Transcript_Extracted.txt',
         'RE7_Transcript_Extracted.txt',
         'RE8_Transcript_Extracted.txt',
         'REV_Transcript_Extracted.txt']

dataframes = []
for file in files:
    dataframes.append(get_dataframe(file))

for data in dataframes:
    print(data['Character'].unique())

['Chris' 'Jill' 'Joseph' 'Wesker' 'Barry' 'Richard' 'Brad' 'Enrico'
 'Voice' 'Rebecca']
['Narrator' 'Leon' 'Truck Driver' 'Claire' 'Kendo' 'Marvin' 'Ada' 'Ben'
 'Annette' 'Umbrella Soldier 1' 'William' 'Umbrella Soldier 2'
 'Alpha Team Leader' 'Computer Voice' 'Sherry' 'Cop' 'Pilot' 'Irons'
 'Umbrella Soldier 3']
['Caller' 'Anchor' 'Trucker' 'Leon' 'Officer' 'Claire' 'Loudspeaker'
 'Elliot' 'Marvin' 'Police Radio' 'Ada' 'Ben' 'Annette' 'Kendo' 'Emma'
 'Kirkpatrick' 'Umbrella Soldier 2' 'System' 'Umbrella Soldier 1' 'Hunk'
 'William' 'Policeman' 'Sherry' 'Irons' 'Computer']
['Jill' 'Helicopter pilot' 'Man' 'Officer 1' 'Radio' 'Soldier 1'
 'Soldier 2' 'Officer 2' 'Officer 3' 'Officer 4' 'Officer 5' 'Soldier 3'
 'Soldier 4' 'Soldier 5' 'Soldier 6' 'Dario' 'Brad' 'Nemesis' 'Carlos'
 'Nicholai' 'Mikhail' 'Murphy' 'Carlos Oliveria' 'Announcement' 'Computer'
 'Barry' 'Reporter']
['Helicopter Loudspeaker' 'Witness' 'Government Official' 'Jill' 'Brad'
 'Dario' 'Carlos' 'Mikhail' 'Guy' 'Man' 'Co

In [79]:
aliases = {"Computer Voice": "Computer",
           "Carlos Oliveria": "Carlos",
           "Nurse On Tape": "Nurse",
           "Dr. Bard On Tape": "Dr. Bard",
           "Sera": "Luis",
           "Midget": "Salazar",
           "Military In Red Beret": "Krauser",
           "Albert Wesker": "Wesker",
           "Captain DeChant": "DeChant",
           "Sheva Alomar": "Sheva",
           "Chris Redfield": "Chris",
           "Captain Josh Stone": "Josh",
           "Bird Lady": "Jill",
           "Dave Johnson": "Dave",
           "Excella Gionne": "Excella",
           "Ozwell E. Spencer": "Spencer",
           "Jill Valentine": "Jill",
           "Bravo": "Bravo Team",
           "Girl On Video": "Mia",
           "Girl From Video": "Mia",
           "Woman Voice": "Woman",
           "Father Of The Family": "Jack Baker",
           "The Mother Of The Family": "Mother Of The Family",
           "The Daughter Of The Family": "Daughter Of The Family",
           "The Son Of The Family": "Son Of The Family",
           "The Son": "Son Of The Family",
           "Girl": "Zoe",
           "Mother Miranda": "Miranda",
           "Man With Hammer": "Heisenberg",
           "Woman With A Large Hat": "Lady Dimitrescu",
           "Woman With A Circle Thing": "Miranda",
           "Mother Miranda": "Miranda",
           "Fat Man": "Duke",
           "Monster Dimitrescu": "Lady Dimitrescu",
           'Monster Heisenberg': 'Heisenberg',
           "Moreau the Giant Toad": "Moreau",
           "Witch Miranda": "Miranda",
           "Chris's Voice": "Chris"
          }

for data in dataframes:
    data['Character'] = data['Character'].replace(aliases)
    print(data['Character'].unique())

['Chris' 'Jill' 'Joseph' 'Wesker' 'Barry' 'Richard' 'Brad' 'Enrico'
 'Voice' 'Rebecca']
['Narrator' 'Leon' 'Truck Driver' 'Claire' 'Kendo' 'Marvin' 'Ada' 'Ben'
 'Annette' 'Umbrella Soldier 1' 'William' 'Umbrella Soldier 2'
 'Alpha Team Leader' 'Computer' 'Sherry' 'Cop' 'Pilot' 'Irons'
 'Umbrella Soldier 3']
['Caller' 'Anchor' 'Trucker' 'Leon' 'Officer' 'Claire' 'Loudspeaker'
 'Elliot' 'Marvin' 'Police Radio' 'Ada' 'Ben' 'Annette' 'Kendo' 'Emma'
 'Kirkpatrick' 'Umbrella Soldier 2' 'System' 'Umbrella Soldier 1' 'Hunk'
 'William' 'Policeman' 'Sherry' 'Irons' 'Computer']
['Jill' 'Helicopter pilot' 'Man' 'Officer 1' 'Radio' 'Soldier 1'
 'Soldier 2' 'Officer 2' 'Officer 3' 'Officer 4' 'Officer 5' 'Soldier 3'
 'Soldier 4' 'Soldier 5' 'Soldier 6' 'Dario' 'Brad' 'Nemesis' 'Carlos'
 'Nicholai' 'Mikhail' 'Murphy' 'Announcement' 'Computer' 'Barry'
 'Reporter']
['Helicopter Loudspeaker' 'Witness' 'Government Official' 'Jill' 'Brad'
 'Dario' 'Carlos' 'Mikhail' 'Guy' 'Man' 'Computer' 'Nicholai' 'Kend

In [80]:
re_genders = {# Resident Evil 1 Characters
              'Chris': 'Male',
              'Jill': 'Female',
              'Joseph': 'Male',
              'Wesker': 'Male',
              'Barry': 'Male',
              'Richard': 'Male',
              'Brad': 'Male',
              'Enrico': 'Male',
              'Rebecca': 'Female',
              # Resident Evil 2 Characters
              'Leon': 'Male',
              'Claire': 'Female',
              'Kendo': 'Male',
              'Marvin': 'Male',
              'Ada': 'Female',
              'Ben': 'Male',
              'Annette': 'Female',
              'Emma': 'Female',
              'Kirkpatrick': 'Male',
              'Hunk': 'Male',
              'William': 'Male',
              'Sherry': 'Female',
              'Irons': 'Male',
              # Resident Evil 3 Characters
              'Man': 'Male',
              'Guy': 'Male',
              'Dario': 'Male',
              'Brad': 'Male',
              'Nemesis': 'Male',
              'Carlos': 'Male',
              'Nicholai': 'Male',
              'Mikhail': 'Male',
              'Murphy': 'Male',
              'Tyrell': 'Male',
              'Dr. Bard': 'Male',
              # Resident Evil 4 Characters
              'Hunnigan': 'Female',
              'Don Esteban': 'Male',
              'Luis': 'Male',
              'Mendez': 'Male',
              'Ashley': 'Female',
              'Saddler': 'Male',
              'Salazar': 'Male',
              'Krauser': 'Male',
              'Mike': 'Male',
              # Resident Evil 5 Characters
              'DeChant': 'Male',
              'Sheva': 'Female',
              'Kirk': 'Male',
              'Fisher': 'Male',
              'Allyson': 'Female',
              'Josh': 'Male',
              'Irving': 'Male',
              'Dave': 'Male',
              'Excella': 'Female',
              'Spencer': 'Male',
              # Resident Evil 6 Characters
              'Jake': 'Male',
              'Sherry': 'Female',
              'Piers': 'Male',
              'Helena': 'Female',
              "Liz's father": 'Male',
              'Liz': 'Female',
              'Deborah': 'Female',
              'Simmons': 'Male',
              'Finn': 'Male',
              'Marco': 'Male',
              'Ada': 'Female',
              'Carla': 'Female',
              'Boy': 'Male',
              # Resident Evil 7 Characters
              'Mia': 'Female',
              'Ethan': 'Male',
              'Pete': 'Male',
              'Andre': 'Male',
              'Jack Baker': 'Male',
              'Zoe': 'Female',
              'Mother Of The Family': 'Female',
              'Son Of The Family': 'Male',
              'Young Girl': 'Female',
              'Daughter Of The Family': 'Female',
              'Eveline': 'Female',
              'Alan': 'Male',
              'Clancy': 'Male',
              # Resident Evil Village Characters
              'Elena': 'Female',
              'Julian': 'Male',
              'Luiza': 'Female',
              'Leonardo': 'Male',
              'Anton': 'Male',
              'Roxana': 'Female',
              'Heisenberg': 'Male',
              'Lady Dimitrescu': 'Female',
              'Doll': 'Female', # May be Donna
              'Miranda': 'Female',
              'Duke': 'Male',
              'Daughters': 'Female',
              'Daughter': 'Female',
              'Second Daughter': 'Female',
              'Third Daughter': 'Female',
              'Woman In Black': 'Female',
              'Moreau': 'Male', 
              'Canine': 'Male',
              'Umber Eyes': 'Male',
              'Lobo': 'Male',
              'Tundra': 'Female',
              'Mom': 'Female',
              'Man In Black': 'Male',
              'Rose': 'Female',
              # Resident Evil Code Veronica Characters
              'Rodrigo': 'Male',
              'Steve': 'Male',
              'Alfred': 'Male',
              'Alexia': 'Female',
              # Misc Characters
              'Man 1': 'Male',
              'Man 2': 'Male',
              'Man 3': 'Male',
              'Man 4': 'Male',
              'Man 5': 'Male',
              'Man 6': 'Male',
              'Guy': 'Male',
              'Drunk Guy': 'Male',
              'Old Man': 'Male',
              'Boy': 'Male',
              'Woman': 'Female',
              'Woman 1': 'Female',
              'Woman Soldier': 'Female',
              'Old Woman': 'Female',
              'Frightened Woman': 'Female',
            }

for data in dataframes:
    data['Gender'] = data['Character'].map(re_genders).fillna(pandas.NA)
    data.insert(1, 'Gender', data.pop('Gender')) # Shift the columns

# This should show the dataframe for the first Resident Evil
dataframes[0]

Unnamed: 0,Character,Gender,Line
0,Chris,Male,"Alpha Team is flying around the forest zone, s..."
1,Chris,Male,Bizarre murder cases have recently occured in ...
2,Jill,Female,"Look, Chris!"
3,Chris,Male,It was Bravo Team's helicopter. Nobody was in ...
4,Joseph,Male,Hey! Come here!
...,...,...,...
684,Brad,Male,Chris! Use it! Destroy the monsters with it!
685,Chris,Male,"Are you tired, Rebecca?"
686,Rebecca,Female,"Sorry, Chris... I am."
687,Chris,Male,You did a really good job. This case was just ...


In [81]:
re_playable_characters = {'Chris': 'Playable',
                          'Jill': 'Playable',
                          'Leon': 'Playable',
                          'Claire': 'Playable',
                          'Ada': 'Playable',
                          'Sherry': 'Playable',
                          'Steve': 'Playable',
                          'Sheva': 'Playable',
                          'Helena': 'Playable',
                          'Piers': 'Playable',
                          'Jake': 'Playable',
                          'Ethan': 'Playable'}

for data in dataframes:
    data['Playable'] = data['Character'].map(re_playable_characters). \
                       fillna('Non-Playable')
    data.insert(2, 'Playable', data.pop('Playable')) # Shift the columns

# This should show the dataframe for the last Resident Evil
dataframes[-1]

Unnamed: 0,Character,Gender,Playable,Line
0,Unknown Speaker,,Non-Playable,Her name is Claire Redfield. We caught her tre...
1,Rodrigo,Male,Non-Playable,Don't Move.
2,Rodrigo,Male,Non-Playable,"Perfect! Go on, get out of here, this place is..."
3,Claire,Female,Playable,What are you saying?
4,Rodrigo,Male,Non-Playable,"Your free to leave the complex, but you may as..."
...,...,...,...,...
361,Chris,Male,Playable,"Hey, you know I always keep my promises."
362,Claire,Female,Playable,"Chris promise me, please promise that you won'..."
363,Chris,Male,Playable,"I'm sorry Claire, but its not over yet, there'..."
364,Claire,Female,Playable,You mean...


In [82]:
# Save the data to a .csv file so we don't have to repeat the process
for index, data in enumerate(dataframes):
    output = get_path_to_file(files[index].rsplit('_', 1)[0] + '.csv')
    data.to_csv(output, encoding = 'utf-8-sig', index = False) # UTF-8-BOM