# MCU Scripts

## Configuration

In [1]:
%load_ext autotime

In [2]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import os

In [3]:
import re

## Analysis

### About the Data

#### Processed Scripts

In [4]:
lines = pd.read_csv('../clean_data/mcu_data.csv', index_col=0).reset_index(drop=True)[['character', 'line', 'movie', 'year', 'words']]
print('Entries: ', len(lines))
lines.head()
processed_movies = lines.groupby(['movie', 'year']).head(1)[['movie', 'year']].sort_values(['movie']).reset_index(drop=True)
processed_movies

Entries:  6509


Unnamed: 0,movie,year
0,Ant-Man,2015
1,Avengers: Age of Ultron,2015
2,Avengers: Endgame,2019
3,Avengers: Infinity War,2018
4,Captain America: Civil War,2016
5,Captain America: The First Avenger,2011
6,Captain America: The Winter Soldier,2014
7,Captain Marvel,2019
8,Iron Man,2008
9,Iron Man 2,2010


#### Raw Scripts

In [5]:
df = pd.read_csv('../raw_data/mcu_scipts.csv', index_col=0).merge(processed_movies, right_on=['movie'], left_on=['title'], how='left')
df['script_length'] = df['script'].apply(len)
print('Entries: ', len(df))
df.head(23)

Entries:  23


Unnamed: 0,title,script,movie,year,script_length
0,Ant-Man,Previous transcript:\n Next transcript:\n\n\n ...,Ant-Man,2015.0,82517
1,Ant-Man and the Wasp,This transcript is not finished!This page does...,,,67957
2,The Avengers,This transcript isn't tidy!This page's transcr...,The Avengers,2012.0,163543
3,Avengers: Age of Ultron,Previous transcript:\n Next transcript:\n\n\n ...,Avengers: Age of Ultron,2015.0,91399
4,Avengers: Endgame,Previous transcript:\n Next transcript:\n\n\n ...,Avengers: Endgame,2019.0,138200
5,Avengers: Infinity War,Previous transcript:\n Next transcript:\n\n\n ...,Avengers: Infinity War,2018.0,141191
6,Black Panther,This transcript isn't tidy!This page's transcr...,,,201332
7,Captain America: Civil War,Previous transcript:\n Next transcript:\n\n\n ...,Captain America: Civil War,2016.0,127046
8,Captain America: The First Avenger,Previous transcript:\n Next transcript:\n\n\n ...,Captain America: The First Avenger,2011.0,71770
9,Captain America: The Winter Soldier,Previous transcript:\n Next transcript:\n\n\n ...,Captain America: The Winter Soldier,2014.0,98173


In [6]:
script = df.script.values[11]
print(script[:1000])

This transcript isn't tidy!This page's transcript is incomplete for the following reason(s):unfixed/messedRemove this template once any and all issues are resolved.

This article is a stub. You can help Transcripts Wiki by  expanding it.



 Previous transcript:
 Next transcript:


 Captain America: Civil War
 Guardians of the Galaxy Vol. 2

     

[scene at the temple, sound of bell ringing. Some people are walking around out of the temple’s library. Showing the librarian are putting back a book to its shelves. the leader of ‘the people’ are showing off with a hood, making the librarian pay attention. two of them follow by, walking through the librarian with their leader. All of them showing off. two of them making spell that hold the librarian’s two arms, and two others making spell by a stick that hold his two legs. making him lifted. the librarian grimacing in pain. someone put a jug below his head. the leader walking closely to the librarian. the leader took off his hood. the libr

# Cleaning Up Script

## Find Actual Start of Script

In [7]:
script[:351]

"This transcript isn't tidy!This page's transcript is incomplete for the following reason(s):unfixed/messedRemove this template once any and all issues are resolved.\n\nThis article is a stub. You can help Transcripts Wiki by  expanding it.\n\n\n\n Previous transcript:\n Next transcript:\n\n\n Captain America: Civil War\n Guardians of the Galaxy Vol. 2\n\n     \n\n"

In [8]:
script = script[351:]
print(script[:1000])

[scene at the temple, sound of bell ringing. Some people are walking around out of the temple’s library. Showing the librarian are putting back a book to its shelves. the leader of ‘the people’ are showing off with a hood, making the librarian pay attention. two of them follow by, walking through the librarian with their leader. All of them showing off. two of them making spell that hold the librarian’s two arms, and two others making spell by a stick that hold his two legs. making him lifted. the librarian grimacing in pain. someone put a jug below his head. the leader walking closely to the librarian. the leader took off his hood. the librarian has a look at the leader. the leader places his hands onto his back and holding a pair of blades as he chops the librarian’s head off which falls into the jug. the leader takes the book that had been placed by the librarian. he opens the book searching for a page then rips it from the book, and throws the book away. he walks from his place and

## Find Characters

In [9]:
script[:10000]

'[scene at the temple, sound of bell ringing. Some people are walking around out of the temple’s library. Showing the librarian are putting back a book to its shelves. the leader of ‘the people’ are showing off with a hood, making the librarian pay attention. two of them follow by, walking through the librarian with their leader. All of them showing off. two of them making spell that hold the librarian’s two arms, and two others making spell by a stick that hold his two legs. making him lifted. the librarian grimacing in pain. someone put a jug below his head. the leader walking closely to the librarian. the leader took off his hood. the librarian has a look at the leader. the leader places his hands onto his back and holding a pair of blades as he chops the librarian’s head off which falls into the jug. the leader takes the book that had been placed by the librarian. he opens the book searching for a page then rips it from the book, and throws the book away. he walks from his place an

In [24]:
characters = np.unique(re.findall(string=script, pattern='\n((?:[A-Z][a-z]+\.? ?)+):'), return_counts=True)
characters = pd.DataFrame(zip(characters[0], characters[1]), columns=['character', 'line_count'])
print(len(characters))
characters.sort_values(['line_count'], ascending=False)

26


Unnamed: 0,character,line_count
11,Dr. Stephen Strange,191
22,The Ancient One,67
7,Doctor Strange,59
19,Mordo,48
2,Christine Palmer,35
1,Christine,30
16,Kaecilius,28
24,Wong,19
9,Dormammu,12
0,Billy,7


## Remove Narration

In [25]:
script_lines_only = re.sub(string=script, pattern='(\[.*\])\n?', repl='')
script_lines_only = re.sub(string=script_lines_only, pattern='\n(\[.*\]?)\n', repl='\n')
script_lines_only = re.sub(string=script_lines_only, pattern='\n(\(.*\)?)\n', repl='\n')
script_lines_only = re.sub(string=script_lines_only, pattern='\xa0', repl=' ')

print(script_lines_only[:1000])

The Ancient One: Master Kaecilius. That ritual will bring you only sorrow
Kaecilius: Hypocrite!




— At a hospital —
Doctor Strange: Challenge round, Billy.
Doctor Strange: Oh, come on, Billy. You’ve got to be messing with me.
Billy: Heheh. No, doctor.
Doctor Strange: Feels So Good, Chuck Mangione, 1977. Seriously, Billy, you said this one would be hard.
Billy: Hah! It’s 1978.
Doctor Strange: No, Billy, while Feels So Good may have charted in 1978, the album was released in December, 1977.
Billy: No, no. Wikipedia says the…
Doctor Strange: Check again.
Billy: When did you…?
Doctor that is helping Stephen: Where do you store all this useless information?
Doctor Strange: Useless? The man charted a top ten hit with a Flugelhorn. Status, Billy?
Dr. Billy: 1977.
Doctor that is helping Stephen: Oh! Please. I hate you.
Doctor Strange: Woah! "Feels so good", doesn’t it?
Doctor Strange: Oh, I…
Doctor that is helping Stephen: I’ve got this, Stephen. You’ve done your bit. Go ahead, we’ll close u

## Charactater Line Extractions

In [27]:
lines = re.findall(string=script_lines_only, pattern='((?:[A-Z][a-z]+[\-\.]? ?)+):\s*(.*)')
lines = pd.DataFrame(lines, columns=['character', 'line'])
lines['line'] = [x.rstrip().lstrip() for x in lines['line']]
lines['character'].value_counts()

Dr. Stephen Strange       191
The Ancient One            67
Doctor Strange             58
Mordo                      48
Christine Palmer           33
Christine                  30
Kaecilius                  28
Wong                       19
Dormammu                   12
Billy                       7
Dr. West                    7
Pangborn                    7
Thor                        6
Young Doctor                5
Etienne                     4
Jonathan                    4
Karl Mordo                  4
Stephen                     3
Doctor Two                  1
Doctor Stephen Strange      1
Doctor One                  1
Sign                        1
Doctor                      1
Crhstine Palmer             1
Master                      1
Strange                     1
Dr. Billy                   1
Dr. Strange                 1
Name: character, dtype: int64