In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,10)

In [None]:
df = pd.read_csv('../data/avatar.csv', encoding = 'unicode_escape').drop(columns=['Unnamed: 0', 'id'])
df.head()

In [None]:
speakers = df.groupby(['character']).size().sort_values(ascending=False)

In [None]:
plt.bar(x=speakers.index[:20], height=speakers.values[:20] / speakers.values.sum() * 100)
plt.xticks(rotation=-45)
plt.title('20 most common speakers')
plt.ylabel('Number of quotes (%)')
plt.xlabel('Speaker name')
plt.show()

# Preprocessing
The plot above shows that there is character named "Scene Description". Scene descriptions are useless for the sake of text (speach) generation. We verify its participation in the whole dataset.

In [None]:
total = df.shape[0]
characters = df[df['character'] != 'Scene Description'].shape[0]
descriptions = df[df['character'] == 'Scene Description'].shape[0]
print(f"{'Total number of expressions:':<30}{total:<10}")
print(f"{'Characters statements:':<30}{characters:<6}( {characters/total * 100:.2f}% )")
print(f"{'Scene descriptions:':<30}{descriptions:<6}( {descriptions/total * 100:.2f}% )")


There are also some troublesome characters in the file:

| occurences    | character         |
|---------------|-------------------|
| 1766          | Aang              |
| 2             | Aang and Sokka    |
| 1             | Aang and Zuko     |
| 1             | Aang:             |
| 1             | Actor Bumi        |
| 5             | Actor Iroh        |
| 2             | Actor Jet         |
| 5             | Actor Ozai        |
| 16            | Actor Sokka       |
| 3             | Actor Toph        |
| 14            | Actor Zuko        |
| 19            | Actress Aang      |
| 10            | Actress Azula     |
| 16            | Actress Katara    |

Hence, we perform some preprocessing, consecutively executing the following steps:
1. Drop scene descriptions.
2. Drop statements spoken by more than 1 character - there is no simple way to assign them to the proper character. Thus, to avoid manual labeling we drop them, because there are only few of thems.
3. Lower all character names.
4. Remove tokens like ":", "actor", "actress" from character names.
5. Transform name to upper case.


In [None]:
df = pd.read_csv('../data/avatar.csv', encoding = 'unicode_escape').drop(columns=['Unnamed: 0', 'id'])

In [None]:
df = df[df['character'] != 'Scene Description']
df = df[~df['character'].str.contains('and')]
df['character'] = df['character'].str.lower()
df['character'] = df['character'].str.replace(':|actor|actress', '', regex=True)
df['character'] = df['character'].str.strip().str.title()
df = df.reset_index()
df = df.drop(columns=['index'])

In [None]:
speakers = df.groupby(['character']).size().sort_values(ascending=False)

plt.bar(x=speakers.index[:20], height=speakers.values[:20] / speakers.values.sum() * 100)
plt.xticks(rotation=-45)
plt.title('20 most common speakers')
plt.ylabel('Number of quotes (%)')
plt.xlabel('Speaker name')
plt.show()

# Exploratory Data Analysis
## English in atla and ordinary world
We want to verify whether general english and atla english are similar. To determine this we compare most frequent words in both english types.

In [None]:
import requests
import nltk

nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return nltk.corpus.wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return nltk.corpus.wordnet.VERB
    elif treebank_tag.startswith('N'):
        return nltk.corpus.wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return nltk.corpus.wordnet.ADV
    else:
        return nltk.corpus.wordnet.NOUN

lemmatize = lambda x: lemmatizer.lemmatize(x, get_wordnet_pos(nltk.pos_tag([x])[0][1]))

In [None]:
freq_atla = df['character_words'].str.lower().str.replace('[^\w\s]', '', regex=True).str.split(expand=True).stack().apply(lemmatize).value_counts().reset_index()
freq_atla[0] = (freq_atla[0] / sum(freq_atla[0]))
freq_atla = freq_atla.values[:20, 0].tolist()
freq_eng = requests.get('https://raw.githubusercontent.com/pkLazer/password_rank/master/4000-most-common-english-words-csv.csv').text.splitlines()[1:101]

In [None]:
print(f'Differences: {len(set(freq_atla).difference(freq_eng[:20]))}')
print('Popular in atla but not in common english')
print(set(freq_atla).difference(freq_eng[:20]))
print('Popular in common english but not in atla')
print(set(freq_eng[:20]).difference(freq_atla))

We can observe that differences are minor. Mismatched tokens are similar parts of speach. The only difference seems to be that in atla more often singular forms of persons are used than in ordinary english (in which more plural forms are used). But proposed comparison is pretty imperfect. Some minor differences, like word from the atla subset being 21st most popular in ordinary language is counted as an error. Thus, we look for words that are in 20 most popular in atla, but don't occur in first 100 tokens of common english.

In [None]:
print(len(set(freq_atla).difference(freq_eng[:100])))
print(set(freq_atla).difference(freq_eng[:100]))

It's very surprising that so many tokens are not included in first 100 tokens. This confirms hypothesis that in atla many sentences are spoken using "I", "me" or "my".

## Characters similarities
In this step we want to find most similar characters. The similarity will be determined using pairwise cosine similarity between tf-idf representation of all statements spoken by a character merged to one.

In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
join = lambda x: ' '.join(sum([val for val in x.values], []))

df_temp = pd.DataFrame(df['character_words'].str.lower().str.replace('[^\w\s]', '', regex=True).str.split(expand=False))
df_temp.insert(0, 'character', df['character'])
df_temp = df_temp.groupby('character').agg({'character_words': join})

cos_sim = cosine_similarity(TfidfVectorizer().fit_transform(df_temp['character_words'].tolist()))
np.fill_diagonal(cos_sim, 0)

idxs = np.unravel_index(np.argsort(cos_sim.ravel())[-10:], cos_sim.shape)
idxs = np.array(idxs).T
idxs.sort(axis=1)
idxs = np.unique(idxs, axis=0)

In [None]:
for i, (x, y) in enumerate(idxs):
    print(f'{df_temp.index[x]} and {df_temp.index[y]}')

Similarities between Aand, Katara and Sokka may be explained quite easy as they are main characters who spent most of the time together. Thus they language is similar (here we can recall polish sentence about talking in companion of crows). The second very interesting thing is similarity between Brainwasher and Joo Does - there is a sceen when Joo Dees repeets Brainwasher's words.