![](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Avatar_The_Last_Airbender_logo.svg/1200px-Avatar_The_Last_Airbender_logo.svg.png)

In [None]:
import pandas as pd
import numpy as np

import plotly_express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

import tensorflow as tf
import tensorflow_hub as hub
from sklearn.decomposition import PCA

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from plotly.offline import init_notebook_mode
init_notebook_mode()

# <center> Avatar </center>

Avatar: The Last Airbender is an animated TV Series that aired on Nickelodean for 3 seasons from February 2005 to July 2008. 

Enough of boring you with information copied from the Wikipedia

But, really matters is that **Avatar: The Last Airbender** is a very cool TV Series and has perhaps the highest IMDB ratings for an animated show. It also has a very huge fanbase. 

In [None]:
data = pd.read_csv('../input/avatar-the-last-air-bender/avatar_data.csv')
series = pd.read_csv('../input/avatar-the-last-air-bender/series_names.csv')
avatar = pd.read_csv('../input/avatar-the-last-air-bender/avatar.csv', encoding = 'latin-1')

In [None]:
avatar['imdb_rating'] = avatar['imdb_rating'].fillna(9.7)

# <center> IMDB Ratings <center>

> **The series consists of sixty-one episodes.Each season of the series is known as a "book", in which each episode is referred to as a "chapter". Each book takes its name from one of the elements Aang must master: Water, Earth, and Fire.** 

-- Source Wikipedia

In [None]:
fig = px.bar(series, x = 'book', y = 'series_rating', template = 'simple_white', color_discrete_sequence=['#f18930'] * 3 ,
             opacity = 0.6, text = 'series_rating', category_orders={'book':['Water','Earth','Fire']}, 
            title = 'IMDB Rating Across Seasons')
fig.add_layout_image(
        dict(
            source="https://i.imgur.com/QWoqOZd.jpg",
            xref="x",
            yref="y",
            x=-0.5,
            y=10,
            sizex=3,
            sizey=10,
            opacity = 0.7,
            sizing="stretch",
            layer="below")
)
fig.show()

As we can see, the average ImDB rating of each season has an upward trend. This is very rare to see and signifies that the viewers enjoyed the show more and more as they continued watching it. 

The Final Season also has the highest ImDB rating and coupling it with the fact that the Finale has the highest ImDB rating for any episode of the show, signifies that **the show was able to create a very compelling storyline that evolved across the seasons and also has a great ending** (which is something not a lot of shows can boast about - here's looking at you GoT).

In [None]:
fig = px.bar(data, x = 'Unnamed: 0', y = 'imdb_rating',color = 'book',hover_name='book_chapt',template = 'simple_white',
             color_discrete_map={'Fire':'#cd0000', 'Water':'#3399ff', 'Earth':'#663307'},labels={'imdb_rating':'IMDB Ratig','Unnamed: 0':'Episode'})
fig.show()

# <center>Directors<center>

In [None]:
director_counts = pd.DataFrame(data['director'].value_counts()).reset_index()
director_counts.columns = ['Director Name', 'Number of Episdoes']

fig = make_subplots(rows=1, cols=2,specs=[[{'type':'bar'}, {'type':'pie'}]], horizontal_spacing=0.2)

directorColors = ['#adbce6'] * 7
directorColors[5] = ['#ba72d4']
director_rating = pd.DataFrame(data.groupby('director')['imdb_rating'].mean()).reset_index().sort_values(by = 'imdb_rating')
trace0 = go.Bar(y = director_rating['director'], x = director_rating['imdb_rating'], orientation='h',
                hovertext=director_rating['imdb_rating'],name = 'Director Average Ratings',
               marker_color = directorColors )
fig.add_trace(trace0, row = 1, col = 1)

trace1 = go.Pie(values= director_counts['Number of Episdoes'],labels = director_counts['Director Name'],name = 'Director Number of Episodes')
fig.add_trace(trace1, row = 1, col = 2)

fig.update_layout(showlegend = False, title = {'text':'Directors and Their Average Rating', 'x':0.5}, template = 'plotly_white')
fig.show()

Although Michael Dante DiMartino has the highest average ImDB rating among the diretors, he has only directed 1 episode. On the other hand, Joaquim Dos Santos has directed 8 episodes with an Average ImDB rating of 9.075. 

In [None]:
character_dialogues = pd.DataFrame({'Character':[], 'Number of Dialogues':[],'Book' : []})
for book in ['Water', 'Earth', 'Fire']:
    temp = pd.DataFrame(avatar[avatar['book'] == book]['character'].value_counts()).reset_index()
    temp.columns = ['Character', 'Number of Dialogues']
    temp['Book'] = book
    temp = temp.sort_values(by = 'Number of Dialogues', ascending = False)
    character_dialogues = pd.concat([character_dialogues, temp])

In [None]:
important_characters = ['Aang', 'Katara', 'Zuko', 'Sokka','Toph','Iroh','Azula']

# <center> Dialogues Analysis <center>

In [None]:
bookColor = {
    'Fire':'#cd0000', 
    'Water':'#3399ff', 
    'Earth':'#663307'
}
fig = make_subplots(rows = 1, cols = 3, subplot_titles=['Water','Earth','Fire'])
for i, book in enumerate(['Water','Earth', 'Fire']):
    temp = character_dialogues[(character_dialogues['Character'] != 'Scene Description') & (character_dialogues['Book'] == book)]
    trace = go.Bar(x = temp.iloc[:10][::-1]['Number of Dialogues'].values, y = temp.iloc[:10][::-1]['Character'].values,
                   orientation = 'h', marker_color = bookColor[book], name = book,opacity=0.8)
    fig.add_trace(trace, row = 1, col = i+1)
fig.update_layout(showlegend = False, template = 'plotly_white', title = 'Characters with Most Dialogues in Each Book')
fig.show()

## Important Insights:

### Possible spoilers ahead.

- Aang had the most dialogues in the First Season. This was obvious as the show was mainly about him and thus, it makes sense to give him more dialogues. 
- Sokka obvious comes out as a jibber-jabber talker as we can see that he becomes the character with the most number of dialogues in subsequent seasons. However, we should not forget that Sokka has also developed as a character and morphed into a Leader and Strategist which could also explain why he needs to speak a lot of dialogues. 
- **Azula was introduced in the second season and seemed to look like a bigger threat for Aang, but was given lesser dialogues than Zuko, which again gives us a signal as to what role she was intented to play. She was to become the main Villain, while Zuko went on towards his path of self-realization and transformation into someone who is worthy of becoming a King.**
- Toph is also introduced in the second season as the Earth Bending master for Aang. As she becomes part of the main gang, we can very easily see that she gets a lot of dialogues to speak. 
- **As Zuko becomes a true successor to the title of Fire King, we see he gets more dialogues. As a result he is on the 3rd spot in terms of number of dialogues in Season 3, ie. Fire Book .**
- **Another important thing to note is that the number of dialogues has greatly reduced after Season 1. In season 1 Aang spoke as many as 818 dialogues. Sokka who has 614 dialogues in the first season, never spoke as much in other seasons even though he tops those charts.**

In [None]:
fig = px.bar(character_dialogues[character_dialogues['Character'].isin(important_characters)],template = 'gridon',title = 'Important Characters Number of Dialogues each season',
             x = 'Number of Dialogues', y = 'Character', orientation = 'h', color='Book',barmode = 'group',
             color_discrete_map={'Fire':'#cd0000', 'Water':'#3399ff', 'Earth':'#663307'})
fig.add_layout_image(
    dict(
        source="https://vignette.wikia.nocookie.net/avatar/images/1/12/Azula.png",
        x=0.25,
        y=0.9,
    ))
fig.add_layout_image(
    dict(
        source="https://vignette.wikia.nocookie.net/avatar/images/4/46/Toph_Beifong.png",
        x=0.42,
        y=0.77,
    ))
fig.add_layout_image(
    dict(
        source="https://vignette.wikia.nocookie.net/avatar/images/c/c1/Iroh_smiling.png",
        x=0.35,
        y=0.6,
    ))
fig.add_layout_image(
    dict(
        source="https://vignette.wikia.nocookie.net/avatar/images/4/4b/Zuko.png",
        x=0.62,
        y=0.47,
    ))

fig.add_layout_image(
    dict(
        source="https://vignette.wikia.nocookie.net/avatar/images/c/cc/Sokka.png",
        x=0.85,
        y=0.32,
    ))
fig.add_layout_image(
    dict(
        source="https://static.wikia.nocookie.net/loveinterest/images/c/cb/Avatar_Last_Airbender_Book_1_Screenshot_0047.jpg",
        x=0.85,
        y=0.18,
    ))
fig.add_layout_image(
    dict(
        source="https://comicvine1.cbsistatic.com/uploads/scale_small/11138/111385676/7212562-5667359844-41703.jpg",
        x=1.05,
        y=0.052,
    ))
fig.update_layout_images(dict(
        xref="paper",
        yref="paper",
        sizex=0.09,
        sizey=0.09,
        xanchor="right",
        yanchor="bottom"
))

fig.show()

In [None]:
chapter_dialogues = pd.DataFrame({'Chapter':[], 'Number of Dialogues':[],'Book' : []})
dialogue_df = avatar[avatar['character']!='Scene Description']
for book in ['Water', 'Earth', 'Fire']:
    temp = pd.DataFrame(dialogue_df[(dialogue_df['book'] == book)]['chapter'].value_counts()).reset_index()
    temp.columns = ['Chapter', 'Number of Dialogues']
    temp['Book'] = book
    chapter_dialogues = pd.concat([chapter_dialogues, temp])
chapter_dialogues = chapter_dialogues.sort_values(by = 'Number of Dialogues')

In [None]:
colors = []
for i in range(20):
    if(chapter_dialogues.iloc[i]['Book'] == 'Fire'):
        colors.append('#cd0000')
    elif(chapter_dialogues.iloc[i]['Book'] == 'Water'):
        colors.append('#3399ff')
    else:
        colors.append('#663307')
trace = go.Bar(x = chapter_dialogues.iloc[:20]['Number of Dialogues'], y = chapter_dialogues.iloc[:20]['Chapter'], 
               orientation = 'h', marker_color = colors)
fig = go.Figure([trace])
fig.update_layout(title = {'text':'Top 20 Episodes with the Most Number of Dialogues', 'x':0.5},
                 xaxis_title="Number of Dialogues",
                 yaxis_title="Chapter Name",
                 template = 'plotly_white')
fig.show()

We have 5 Entries from the first Season into the Top 20 Episodes with most dialogues, while we have 8 from the Second and 7 from the Third Season. 

The Drill is the episode with the most number of dialogues, here's an overview of the Episode taken from the [Avatar Wiki](https://avatar.fandom.com/wiki/The_Drill) in case you don't remember:

> Having successfully navigated the Serpent's Pass, Aang is determined to journey to Ba Sing Se in the hopes of finding his lost bison, Appa. However, he discovers a Fire Nation drill heading straight for Ba Sing Se, intent on destroying the wall. Aang and the group succeed in demolishing the drill from the inside of the mechanism. Meanwhile, Jet wishes to recruit Zuko as a member of the Freedom Fighters, only to learn Zuko and Iroh are firebenders.

In [None]:
ratings = []
for i in range(len(chapter_dialogues)):
    chapter = chapter_dialogues.iloc[i]['Chapter']
    imdb_rating = avatar[avatar['chapter'] == chapter]['imdb_rating'].mean()
    ratings.append(imdb_rating)
chapter_dialogues['IMDB Rating'] = ratings
chapter_dialogues['IMDB Rating'].fillna(9.7, inplace = True)

In [None]:
chapter_dialogues['Dialogues Per Rating'] = chapter_dialogues['Number of Dialogues'] / chapter_dialogues['IMDB Rating']
chapter_dialogues = chapter_dialogues.sort_values(by = 'Dialogues Per Rating')

We see the Top 20 Episodes why require the least number of dialogues to gain one unit of ImDB rating. 

In [None]:
colors = []
for i in range(20):
    if(chapter_dialogues.iloc[i]['Book'] == 'Fire'):
        colors.append('#cd0000')
    elif(chapter_dialogues.iloc[i]['Book'] == 'Water'):
        colors.append('#3399ff')
    else:
        colors.append('#663307')
trace = go.Bar(x = chapter_dialogues.iloc[:20][::-1]['Dialogues Per Rating'], y = chapter_dialogues.iloc[:20][::-1]['Chapter'],
              text = chapter_dialogues.iloc[:20][::-1]['IMDB Rating'], orientation = 'h', marker_color = colors, 
              textposition="outside",texttemplate='%{text:.2s}',
              textfont=dict(
              family="sans serif",
              size=18,
              color="Black")
)
fig = go.Figure([trace])
fig.update_layout(title = {'text':'Top 20 Episodes with the Least Dialogues Per Rating', 'x':0.5},
                 xaxis_title="Num of Dialogues / IMDB Rating",
                 yaxis_title="Chapter Name",
                 template = 'plotly_white')
fig.show()

## Important Insights:

- Sozin's Comet Part 4 is the Finale episode of the show. There was a lot of action in terms of fight scenes which explains why it is at the top as there wasn't many dialogues in those war scenes.

In [None]:
fig  = px.scatter(chapter_dialogues, x = 'Number of Dialogues', y = 'IMDB Rating', trendline = 'ols', color = 'Book',
                 color_discrete_map={'Fire':'#cd0000', 'Water':'#3399ff', 'Earth':'#663307'},hover_name='Chapter' ,template = 'plotly_white',
                 title = 'Relation Between Number of Dialogues and IMDB Rating')
fig.show()

# <center> Analyzing what the Major Characters say <center>

In [None]:
stopwords = set(STOPWORDS)
def createCorpus(character_name):
    df = avatar[avatar['character'] == character_name]
    corpus = ""
    for des in df['character_words'].to_list():
        corpus += des
    return corpus

def generateWordCloud(character_name, background_color):
    plt.subplots(figsize=(12,8))
    corpus = createCorpus(character_name)
    wordcloud = WordCloud(background_color=background_color,
                          contour_color='black', contour_width=4, 
                          stopwords=stopwords,
                          width=1500, margin=10,
                          height=1080
                         ).generate(corpus)
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()

## Aang

In [None]:
generateWordCloud('Aang', 'White')

Aang talks to Katara (love-interest) and Sokka (love-interest's brother) a lot. 

He also Mentions Appa more than Momo

## Katara

In [None]:
generateWordCloud('Katara', 'LightBlue')

Katata also talks a lot to Aang and Sokka.

## Sokka

In [None]:
generateWordCloud('Sokka','Blue')

The above three wordclouds highlight the friendship between the 3 main characters of the show - Aang, Katara and Sokka. 

## Toph

In [None]:
generateWordCloud('Toph','Brown')

It is interesting to note that one of the most frequent words Toph uses is 'See', which is ironic since she is actually blind. 

## Zuko

In [None]:
generateWordCloud('Zuko', 'Red')

Prince Zuko, has had the most conversations with his Uncle Iroh. 

Also, he is shown to be primarily concerned with the capture of the Avatar which can be seen from the wordcloud.

His desire to prove to his ability to his father and to return home, by the prominent use of the word Father.

## Iroh

In [None]:
generateWordCloud('Iroh', 'Pink')

We see mentions of the Lotus Tile, Tea and the Tea shop in Iroh's wordcloud. 

His main converstaions are with Zuko which is evident from the wordcloud with mentions of Zuko, Prince Zuko and Nepthew. 

## Azula

In [None]:
generateWordCloud('Azula','Red')

Azula like Zuko is also shown to mainly concern with the capture of Avatar and showing her skills to her Father. 

Writers spent a lot of time writing 'KNOW' ;)

In [None]:
sentInt = SentimentIntensityAnalyzer()
def get_vader_score(character_name, key = 'pos'):
    corpus = createCorpus(character_name)
    sentimentScore = sentInt.polarity_scores(corpus)
    return sentimentScore[key]


character_sent_dict = {}
for sentiment in ['pos', 'neg', 'neu']:
    char_sents = []
    for character in important_characters:
        char_sents.append(get_vader_score(character, key = sentiment))
    character_sent_dict[sentiment] = char_sents
character_sent_dict['Character Name'] = important_characters
character_sentiments = pd.DataFrame(character_sent_dict)

In [None]:
fig = px.bar(character_sentiments, x = ['pos', 'neg','neu'], y = 'Character Name',barmode='group',
             labels = {'pos':'Positive', 'neg':'Negative','neu':'Neutral', 'value':'Sentiment Score'},
             title = 'Sentiment Analysis of Characters',
             template = 'presentation')
fig.show()

* **Positive Sentiment** : Uncle Iroh has the highest Positive Sentiment Score and Sokka has the lowest Positive sentiment score.
* **Negative Sentiment** : Azula has the highest Negative Sentiment Score and Aang has the lowest Negative Sentiment Score.
* **Neutral Sentiment** : Sokka's statements are the most neutral in nature.

# <center> Finding Similar Episodes <center>

In [None]:
chapterCorpus = pd.DataFrame({'Chapter Name' : [], 'Full Text': [], 'Book' : []})
chapters = []
chapterTexts = []
books = []
for book in ['Water', 'Earth', 'Fire']:
    subBook = avatar[(avatar['book'] == book) & (avatar['character']!='Scene Description')]
    for chapter_name, df in subBook.groupby('chapter'):
        full_text = df['character_words'].values
        chapters.append(chapter_name)
        chapterTexts.append(" ".join(full_text).lower())
        books.append(book)
chapterCorpus['Chapter Name'] = chapters
chapterCorpus['Full Text'] = chapterTexts
chapterCorpus['Book'] = books

In [None]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)

features = model(chapterCorpus['Full Text'].values)
pca = PCA(n_components=2, random_state=42)
reduced_features = pca.fit_transform(features)

chapterCorpus['Dimension 1'] = reduced_features[:,0]
chapterCorpus['Dimension 2'] = reduced_features[:,1]
fig = px.scatter(chapterCorpus, x = 'Dimension 1', y = 'Dimension 2', color = 'Book', hover_name='Chapter Name',
                color_discrete_map={'Fire':'#cd0000', 'Water':'#3399ff', 'Earth':'#663307'},
                title = 'Finding Similar Episodes',
                template = 'plotly_white')
fig.update_traces(marker=dict(size=12))
fig.show()

While we cannot see distinct clusters forming, we can definitely see how some episodes are clustered together, while some episodes are far much different from the others. 

The most important insight I can see is that the episode - **Tales of Ba Sing Se** is so different from the other episodes. I remember this episode to be much different from the other episodes in terms of the way the story was presented. 

# <center>Characters Dialogues and IMDB Ratings<center>

Does the number of Dialogues a character say have an impact on the Rating of the episode?

In [None]:
chapterwise_dialogues = pd.DataFrame({})
for character in important_characters:
    character_df = avatar[avatar['character'] == character]
    chapter_counts = character_df.groupby('chapter').size().reset_index()
    chapter_counts.columns = ['chapter','Num of Dialogues']
    imdb_ratings = character_df.groupby('chapter')['imdb_rating'].mean().reset_index()
    dialogues_and_rating = pd.merge(chapter_counts, imdb_ratings)
    dialogues_and_rating['Character'] = character
    chapterwise_dialogues = pd.concat([chapterwise_dialogues, dialogues_and_rating])

In [None]:
fig = px.scatter(chapterwise_dialogues, 
                 x = 'chapter',y='imdb_rating', size='Num of Dialogues',
                 facet_col='Character',facet_col_wrap=2, 
                template = 'plotly_white')
fig.update_xaxes(matches = None,visible = False)
fig.show()

We do not see any such trends in the dataset. Thus, the number of times we hear a character does not affect the episode rating