Goal: Predict ATLA episode rating

Why some episodes might have better ratings:
- Feature fan favorites (ex: Iroh) => proportion of word counts
- Director have different rating
- Episode plots: emotion, action, ...

Feature Engineering:
- Character First Appearance
- Favorite Character Count

Investigation:
- Which caracter is positively correlated with higher episode rating? Dialogue Volume vs Rating?
- Proportion of main characters dialogue per episode vs secondary vs once
- Can we perform dimensionality reduction to replace character with its archetype?

### 1. Loading the data

In [63]:
import pandas as pd


In [64]:
df_avatar = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-08-11/avatar.csv')
df_scene = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-08-11/scene_description.csv')

In [65]:
df_avatar.head()

Unnamed: 0,id,book,book_num,chapter,chapter_num,character,full_text,character_words,writer,director,imdb_rating
0,1,Water,1,The Boy in the Iceberg,1,Katara,Water. Earth. Fire. Air. My grandmother used t...,Water. Earth. Fire. Air. My grandmother used t...,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
1,2,Water,1,The Boy in the Iceberg,1,Scene Description,"As the title card fades, the scene opens onto ...",,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
2,3,Water,1,The Boy in the Iceberg,1,Sokka,It's not getting away from me this time. [Clos...,It's not getting away from me this time. Watc...,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
3,4,Water,1,The Boy in the Iceberg,1,Scene Description,"The shot pans quickly from the boy to Katara, ...",,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
4,5,Water,1,The Boy in the Iceberg,1,Katara,"[Happily surprised.] Sokka, look!","Sokka, look!","‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1


In [66]:
df_avatar.columns

Index(['id', 'book', 'book_num', 'chapter', 'chapter_num', 'character',
       'full_text', 'character_words', 'writer', 'director', 'imdb_rating'],
      dtype='str')

In [67]:
df_scene.head()

Unnamed: 0,id,scene_description
0,3,[Close-up of the boy as he grins confidently o...
1,5,[Happily surprised.]
2,6,[Close-up of Sokka; whispering.]
3,6,[A look of bliss adorns his face. He licks his...
4,8,[Struggling with the water that passes right i...


### 2.1. Soft EDA

In [68]:
# --- popular character by book number
df_popular = df_avatar[~df_avatar['character'].isin(['Scene Description'])].groupby(['book_num', 'character']).size().reset_index(name='counts').sort_values(by=['book_num', 'counts'], ascending=False)
print(df_popular.loc[(df_popular['book_num'] == 1) & (df_popular['counts'] > 15)])
print(df_popular.loc[(df_popular['book_num'] == 2) & (df_popular['counts'] > 15)])
print(df_popular.loc[(df_popular['book_num'] == 3) & (df_popular['counts'] > 15)])

     book_num       character  counts
0           1            Aang     818
57          1          Katara     636
103         1           Sokka     614
130         1            Zuko     174
51          1            Iroh     129
129         1            Zhao     107
54          1             Jet      74
126         1             Yue      51
11          1            Bumi      45
69          1       Mechanist      35
53          1     Jeong Jeong      30
82          1           Pakku      29
113         1             Teo      29
8           1            Bato      28
124         1              Wu      28
128         1    Zhang leader      28
101         1            Shyu      25
110         1            Suki      25
48          1            Haru      23
36          1  Gan Jin leader      22
5           1          Arnook      18
44          1          Gyatso      18
55          1            June      18
70          1            Meng      18
14          1    Canyon guide      17
33          

#### 2.1.1. Which character is positively correlated with higher episode rating

In [69]:


episode_character_stats = df_avatar[~df_avatar['character'].isin(['Scene Description'])].groupby(['book_num', 'chapter_num', 'character'])['character_words'].sum().reset_index()
episode_character_stats['character_words'] = episode_character_stats['character_words'].str.len()

main_characters = ['Aang', 'Katara', 'Sokka', 'Zuko', 'Toph']
min_word_mask = (episode_character_stats['character_words'] > 15)
main_character_mask = (episode_character_stats['character'].isin(main_characters))
secondary_characters = episode_character_stats[min_word_mask & ~main_character_mask]['character'].unique()

character_mask = episode_character_stats['character'].isin(main_characters + list(secondary_characters))
episode_pivot = episode_character_stats[character_mask].pivot_table(
    index=['book_num', 'chapter_num'],
    columns='character',
    values='character_words',
    fill_value=0
)
# TODO: normalize 



episode_pivot = episode_pivot.merge(df_avatar[['book_num', 'chapter_num', 'imdb_rating']].drop_duplicates(),
                                    on=['book_num', 'chapter_num'])
correlations = episode_pivot.corr()['imdb_rating'].sort_values(ascending=False)
top_characters = correlations[1:10]  # top positively correlated characters

In [70]:
episode_pivot

Unnamed: 0,book_num,chapter_num,Aang,Aang:,Actor Bumi,Actor Iroh,Actor Jet,Actor Ozai,Actor Sokka,Actor Toph,...,Younger guest,Yu,Yue,Yugoda,Yung,Zei,Zhang leader,Zhao,Zuko,imdb_rating
0,1,1,2151.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,838.0,8.1
1,1,2,1525.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,658.0,8.3
2,1,3,2435.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1889.0,1262.0,8.5
3,1,4,2287.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,640.0,8.2
4,1,5,2818.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,3,17,767.0,0.0,75.0,243.0,108.0,419.0,784.0,321.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1032.0,8.6
57,3,18,1752.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2900.0,9.1
58,3,19,1972.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1601.0,9.5
59,3,20,131.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,429.0,9.8


In [None]:
from sklearn.decomposition import PCA