# A philosopher's stone analysis.

The aim of this project is to make a simple sentiment analysis of *Harry Potter and the philosopher's stone* script. Special thanks to Erin Ward's dataset without which this notebook couldn't have been done.

In [None]:
import numpy as np
import pandas as pd

# for the plots
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib inline
import seaborn as sns

import string
import pickle

# for linear regressions
from scipy import stats

# for linear regressions with train and test data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 1. Exploratory Data Analsys.

Let's open the file and read its first lines.

In [None]:
filename = '/kaggle/input/harry-potter-and-the-philosophers-stone-script/hp_script.csv'
script = pd.read_csv(filename, encoding='cp1252')
script.head()

The first column can be dropped, as pandas.DataFrame has automatically generated a column of indices.

In [None]:
script.drop(columns='ID_number', inplace=True)
script.head()

Let's check if the *dialogue* column is complete, just in case.

In [None]:
script.loc[0, 'dialogue']

As expected, it is complete. Let's check for missing values now.

In [None]:
script.isna().sum()

No missing values are found. Great!

## 1.1 Number of lines.

As we want to apply statistical analysis to what the charachters say, we have to previously filter those characters with a minimum number of lines in the movie.

Let's count how many lines has each character and store it in an array, *lines*. For instance, we see that Harry Potter has 230 lines within the whole movie.

In [None]:
lines = script['character_name'].value_counts()
lines['Harry Potter']

We can make another DataFrame, *character*, that contains this information: each character in the movie and how many lines this character has. The first column are the unique values of *script['character_name']* and the second can be taken from the array we've just created, *lines*. We see the top 3 are Harry Potter, Ron Weasley, Hermione Granger. Who would've told?

In [None]:
character = pd.DataFrame(script['character_name'].unique(), columns=['name'])
character['lines'] = character['name'].apply(lambda name: lines[name])
character.sort_values(by='lines', ascending=False, inplace=True)
character.head()

Before plotting these values, let's create a new column in the *character* dataframe: *color*. We'll just set specific colors for the three main characters.

In [None]:
character['color'] = 'grey'

character.set_index('name', inplace=True)

character.loc['Harry Potter', 'color'] = 'green'
character.loc['Ron Weasley', 'color'] = 'red'
character.loc['Hermione Granger', 'color'] = 'brown'
character.head()

It's time to plot now. It's not surprising that Harry Potter has the greatest number of lines in a movie called *Harry Potter*.

In [None]:
top10 = character.head(10)

plt.title('Top 10 number of lines of characters.')
plt.barh(top10.index, top10['lines'], color=top10['color'])
plt.gca().invert_yaxis()
plt.xlabel('Number of lines')
plt.ylabel('Character')
plt.grid()
plt.show()

Before filtering, we also have to analyse the number of words of each character. Indeed, what if a charachter has many lines but these lines are just single words or very short sentences? A character with less lines but more words is more important that a character with more lines but less words (at least in this analysis, as our purpose is to find sentiments in rather long sentences, not in short phrases like *Yes*, *No* or *Thanksss*.).

## 1.2 Number of words.

The known functions *len* and *split* will do the job of counting how many words there are in each line of dialogue. Notice how contractions don't matter - for instance, *I'm* is considered a single word.

In [None]:
script['words'] = script['dialogue'].apply(lambda x: len(x.split()))
script.head()

As before, we store the total number of words in an array. This time, we call it *words*. We see that Harry's 230 lines add up to 1609 words in the movie.

In [None]:
words = script[['character_name', 'words']].groupby('character_name').sum()
words.loc['Harry Potter', 'words']

To add this information in the dataframe *character*, we reset the indices and then set *name* as the indices again. This is just to make the *apply* easier.

In [None]:
character.reset_index(inplace=True)
character['words'] = character['name'].apply(lambda name: words.loc[name, 'words'])
character.set_index('name', inplace=True)
character.head()

Now we plot both graphs: number of lines and number of words. We observe how Rubeus Hagrid, who is 4th place in the number of lines ranking, becomes 2nd in the number of words ranking, meaning he has less lines than Ron or Hermione, but longer sentences.

In [None]:
top10lines = character.sort_values(by='lines', ascending=False).head(10)
top10words = character.sort_values(by='words', ascending=False).head(10)

fig, ax = plt.subplots(1, 2, figsize=(12, 5))

ax[0].set_title('Top 10 number of lines of characters.')
ax[0].barh(top10lines.index, top10lines['lines'], color=top10lines['color'])
ax[0].invert_yaxis()
ax[0].set_xlabel('Number of lines')
ax[0].set_ylabel('Character')
ax[0].grid()

ax[1].set_title('Top 10 number of words of characters.')
ax[1].barh(top10words.index, top10words['words'], color=top10words['color'])
ax[1].invert_yaxis()
ax[1].set_xlabel('Number of words')
ax[1].set_ylabel('Character')
ax[1].grid()

fig.tight_layout()

fig.show()

We can also use the method *.describe()* to check the principal features of the numerical values of *character*.

In [None]:
character.describe()

This has some valuable information: to start, there are 41 characters that have at least one line (and at least one word) in the movie. We have a mean 19 number of lines per character and 220 words in the whole movie. Also, the minimum of lines and words is 1, which means that there are characters that have just one line of one word (like the snake: *thanksss*). We already know the maximum: Harry Potter has 230 lines in the whole movie and 1609 words in total.

If we plot the pairs (lines, words), we observe a pretty linear relation.

In [None]:
def line(slope, intercept, x):
    return slope * x + intercept

In [None]:
def words_vs_lines(df, scale = 'lin'):
    
    copy = df.copy()
    
    if scale == 'log':
        copy['lines'] = np.log(df['lines'])
        copy['words'] = np.log(df['words'])
    
    x = copy['lines']
    y = copy['words']
    c = copy['color']
    
    # subset for text labels
    mask = df['words'] > 400
    t = copy[mask]

    # LINEAR REGRESSION with H, R, H
    s1, i1, r1, p1, e1 = stats.linregress(x, y)
    if scale == 'log':
        label1=f'HRH: log(words) = {s1.round(1)} * log(lines) + {i1.round(1)}, r2 = {r1.round(2)}'
    else:
        label1=f'HRH: words = {s1.round(1)} * lines + {i1.round(1)}, r2 = {r1.round(2)}'
    
    # LINEAR REGRESSION without H, R, H
    main_characters = ['Harry Potter', 'Ron Weasley', 'Hermione Granger']
    xx = x[~x.index.isin(main_characters)]
    yy = y[~y.index.isin(main_characters)]
    s2, i2, r2, p2, e2 = stats.linregress(xx, yy)
    if scale == 'log':
        label2=f'noHRH: log(words) = {s2.round(1)} * log(lines) + {i2.round(1)}, r2 = {r2.round(2)}'
    else:
        label2=f'noHRH: words = {s2.round(1)} * lines + {i2.round(1)}, r2 = {r2.round(2)}'
    
    # FIGURE
    plt.title('Words versus lines.')
    if scale == 'log':
        plt.xlabel('Log of number of lines')
        plt.ylabel('Log of number of words')
    else:
        plt.xlabel('Number of lines')
        plt.ylabel('Number of words')

    # scatter
    plt.scatter(x, y, c='none', edgecolor=c)
    t.apply(lambda row: plt.text(row['lines'], row['words'], row.name, c=row['color']), axis=1)

    # lines
    x_array = np.array([min(x), max(x)])
    plt.plot(x_array, line(s1, i1, x_array), c='g', ls='--', lw=1, label=label1)
    plt.plot(x_array, line(s2, i2, x_array), c='k', lw=1, label=label2)

    # limits
    margin = (max(y) - min(y)) / 10
    plt.ylim([min(y) - margin, max(y) + margin])
        
    # legend
    plt.legend()
    
    plt.show()

In [None]:
words_vs_lines(character)
words_vs_lines(character, 'log')

We observe how the points corresponding to the main characters may deviate from the apparent line the rest of characters follow, specially on the linear scatter. Indeed, Harry, Ron and Hermione's points are further to the right than the rest, meaning that they have more lines, but that their sentences are shorter and so they don't follow the same ratio of words per line like the rest of the characters.

We observe how taking Harry, Ron and Hermione out of the set of characters, the linear regression is slightly better: the $r^2$ score is 0.95 against the 0.91 obtained when considering the main characters trio for the linear regression.

Using logarithms, we see an improvement on the 0.91 obtained before: now the $r^2$ score is 0.92. However, discarding Harry, Ron and Hermione makes no difference for the analysis.

If we had to test these linear models, we should separate the input data into two parts: train and test. We can make a function that, given a (train, test) pair, trains a model and makes a plot of predicted targets vs truth values.

In [None]:
def linreg(X_train, X_test, y_train, y_test):

    # 1. Linear Regression.
    reg = LinearRegression()
    reg.fit(X_train, y_train)

    pred = reg.predict(X_test)

    print(f'y = {reg.intercept_.round(2)}', end=' ')
    i = 0
    for c in reg.coef_.round(2):
        i += 1
        print(f'+ {c} * x{i}', end=' ')
    
    r2_test = r2_score(y_test, pred)
    r2_train = r2_score(y_train, reg.predict(X_train))

    up = max(max(pred), max(y_test))
    down = min(min(pred), min(y_test))
    margin = (up - down) / 20
    
    # 2. Plot.
    plt.title('Linear regression.')
    plt.scatter(pred, y_test, color='black', label=f'test, r2 = {r2_test.round(2)}')
    plt.scatter(y_train, reg.predict(X_train), color='none', edgecolor='black', label=f'train, r2 = {r2_train.round(2)}')
    plt.plot([down - margin, up + margin], [down - margin, up + margin], linewidth=1, color='blue')
    plt.xlabel('Prediction')
    plt.ylabel('Truth')
    plt.xlim([down - margin, up + margin])
    plt.ylim([down - margin, up + margin])
    plt.legend()
    plt.grid()

    plt.show()

We can apply this function to the values in a linear scale, but this results in a worse $r^2$ score: the characters with high numbers of lines and words deviate the prediction.

In [None]:
X = character[['lines']]
y = character['words']

X_train, X_test, y_train, y_test = train_test_split(X, y)

linreg(X_train, X_test, y_train, y_test)

In [None]:
X_train = np.log(X_train)
X_test = np.log(X_test)
y_train = np.log(y_train)
y_test = np.log(y_test)

linreg(X_train, X_test, y_train, y_test)

The test $r^2$ depends on the test sample taken. For this analysis, there are only 41 points, so the deviation can be high. The train $r^2$ is the score obtained from the train sample selected. It can be observed that for this data, the log-log scale is preferable: the points are nicely gathered around the identity line, which makes both $r^2$ scores higher.



## 1.3 Words per line.

A ratio between words and lines comes to mind with this information. Of course, a *words per line* ratio (simplified as *wpl*) can be helpful in this analysis.

In [None]:
character['wpl'] = character.apply(lambda row: row.words / row.lines, axis=1)
character.head()

We observe how our beloved Harry, Ron and Hermione fall to positions 28th, 25th and 21st in the *words per line* ranking. This means that, although having the top 3 number of lines, their sentences are short: under 10 words per line.

On the other hand, Ollivander the wand-maker is #1 in this ranking: over 40 words per line.

In [None]:
top30wpl = character.sort_values(by='wpl', ascending=False).head(30)

plt.figure(figsize=(10,10))

plt.title('Top 30 number of words per line of characters.')
plt.barh(top30wpl.index, top30wpl['wpl'], color=top30wpl['color'])
plt.gca().invert_yaxis()
plt.xlabel('Words per line')
plt.ylabel('Character')
plt.grid()
plt.show()

## 1.4 Who said what?

A function that counts how many times each character has said a word can also be done. First, let's copy the *script* dataframe and rename its columns.

In [None]:
words = script[['character_name', 'dialogue']].copy()
words.columns = ['character', 'word']
words.head()

Now remove punctuation marks and change uppercase letters to lowercase. See how contractions lose their apostrophe: *I'm* becomes *im*.

In [None]:
words['word'] = words['word'].str.replace('[^\w\s]', '')
words['word'] = words['word'].str.lower()
words.head()

We split the strings into lists of words.

In [None]:
words['word'] = words['word'].str.split()
words.head()

And make a row for each word (with its corresponding character).

In [None]:
words = words.explode('word').reset_index(drop=True)
words.head()

The function we're looking for can be made:

In [None]:
def say_my_name(name):
    df = words[words['word'] == name]
    df = df.groupby('character').count()
    df = df.sort_values(by='word', ascending=False)
    
    top10 = df.copy().head(10)
    
    top10['color'] = 'grey'
    top10.loc['Harry Potter', 'color'] = 'green'
    top10.loc['Ron Weasley', 'color'] = 'red'
    top10.loc['Hermione Granger', 'color'] = 'brown'
    
    plt.title(f'Who said {name}?')
    plt.barh(top10.index, top10['word'], color=top10['color'])
    plt.gca().invert_yaxis()
    plt.xlabel(f'Number of times a character says {name}')
    plt.ylabel('Character')
    plt.grid()
    plt.show()

With this function, we can easily visualise how many times a character says a word, like 'Harry', 'Potter', 'Weasley', 'Magic', 'Quidditch', 'Football' and so. As can be seen below, only Harry, Hagrid and Hermione dare to pronounce the name of You Know Who.

In [None]:
say_my_name('voldemort')

# 2. Tokenize and lemmatize.

In [None]:
from nltk import word_tokenize
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

It's not easy to work with contractions: for instance, what is *I'd*? Depending on the context, it can be *I had* or *I would*. This analysis doesn't take into account this depth of language, so we'll have enough with a simple dictionary of contractions.

In [None]:
contractions_dict = {"ain't": 'am not', "aren't": 'are not', "can't": 'cannot', "can't've": 'cannot have', "'cause": 'because', "could've": 'could have', "couldn't": 'could not', "couldn't've": 'could not have', "didn't": 'did not', "doesn't": 'does not', "don't": 'do not', "hadn't": 'had not', "hadn't've": 'had not have', "hasn't": 'has not', "haven't": 'have not', "he'd": 'he would', "he'd've": 'he would have', "he'll": 'he will', "he'll've": 'he will have', "he's": 'he is', "how'd": 'how did', "how'd'y": 'how do you', "how'll": 'how will', "how's": 'how is', "i'd": 'i would', "i'd've": 'i would have', "i'll": 'i will', "i'll've": 'i will have', "i'm": 'i am', "i've": 'i have', "isn't": 'is not', "it'd": 'it had', "it'd've": 'it would have', "it'll": 'it will', "it'll've": 'it will have', "it's": 'it is', "let's": 'let us', "ma'am": 'madam', "mayn't": 'may not', "might've": 'might have', "mightn't": 'might not', "mightn't've": 'might not have', "must've": 'must have', "mustn't": 'must not', "mustn't've": 'must not have', "needn't": 'need not', "needn't've": 'need not have', "o'clock": 'of the clock', "oughtn't": 'ought not', "oughtn't've": 'ought not have', "shan't": 'shall not', "sha'n't": 'shall not', "shan't've": 'shall not have', "she'd": 'she would', "she'd've": 'she would have', "she'll": 'she will', "she'll've": 'she will have', "she's": 'she is', "should've": 'should have', "shouldn't": 'should not', "shouldn't've": 'should not have', "so've": 'so have', "so's": 'so is', "that'd": 'that would', "that'd've": 'that would have', "that's": 'that is', "there'd": 'there had', "there'd've": 'there would have', "there's": 'there is', "they'd": 'they would', "they'd've": 'they would have', "they'll": 'they will', "they'll've": 'they will have', "they're": 'they are', "they've": 'they have', "to've": 'to have', "wasn't": 'was not', "we'd": 'we had', "we'd've": 'we would have', "we'll": 'we will', "we'll've": 'we will have', "we're": 'we are', "we've": 'we have', "weren't": 'were not', "what'll": 'what will', "what'll've": 'what will have', "what're": 'what are', "what's": 'what is', "what've": 'what have', "when's": 'when is', "when've": 'when have', "where'd": 'where did', "where's": 'where is', "where've": 'where have', "who'd": 'who would', "who'll": 'who will', "who'll've": 'who will have', "who's": 'who is', "who've": 'who have', "why's": 'why is', "why've": 'why have', "will've": 'will have', "won't": 'will not', "won't've": 'will not have', "would've": 'would have', "wouldn't": 'would not', "wouldn't've": 'would not have', "y'all": 'you all', "y'alls": 'you alls', "y'all'd": 'you all would', "y'all'd've": 'you all would have', "y'all're": 'you all are', "y'all've": 'you all have', "you'd": 'you had', "you'd've": 'you would have', "you'll": 'you you will', "you'll've": 'you you will have', "you're": 'you are', "you've": 'you have'}

We create a new column to our dataframe *script*: *tokens*.

In [None]:
script['tokens'] = script['dialogue'].str.lower()
script.head()

We make a function to replace contractions.

In [None]:
def dict_replace(sentence):
    for key in contractions_dict:
        sentence = sentence.replace(key, contractions_dict[key])
    return sentence

And apply it to our dataframe. See how *i'm* has become *i am*.

In [None]:
script['tokens'] = script['tokens'].apply(dict_replace)
script.head()

We apply the tokenizator of nltk.

In [None]:
script['tokens'] = script['tokens'].apply(word_tokenize)
script.head()

We define a function to filter and lemmatize the tokens. Check this link for Part Of Speech tags: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
stop_words = stopwords.words('english')

def clean_tokens(tokens_list):
    cleaned_tokens_list = []
    
    # Identify Part Of Speech (POS)
    for token, tag in pos_tag(tokens_list):
        if tag == 'NN' or tag == 'NNS':
            # Noun (non proper)
            pos = 'n'
        elif tag.startswith('VB'):
            # Verb
            pos = 'v'
        elif tag.startswith('JJ'):
            # Adjective
            pos = 'a'
        else:
            continue
        
        # Lemmatize (for instance, cats -> cat, bringing -> bring, great -> good)
        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)
        
        # Filter out punctuation marks and stop_words
        if token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens_list.append(token.lower())
        
    return cleaned_tokens_list

And apply it.

In [None]:
script['clean_tokens'] = script['tokens'].apply(clean_tokens)
script.head()

See how a sentence like *And the boy?* becomes just *boy* after the filtering, as *and* and *the* are stop words. Also, we observe how *bringing* becomes *bring* after the lemmatizer has been applied.

# 3. Sentiment analysis with Twitter model.

We use the twitter model we've developed in another notebook. See https://www.kaggle.com/mlopez13/twitter-sentiment-analysis

In [None]:
filename = '/kaggle/input/twitter-sentiment-analysis/twitter_model.sav'
model = pickle.load(open(filename, 'rb'))

Given a dictionary of tokens, the model gives the probability that the tokens come from a positive message. For instance, because tweets containing the verb *kill* are typically negative, the probability that *kill* comes from a positive message is low, around 14%.

In [None]:
dist = model.prob_classify({'kill': True})
dist.prob('pos')

We prepare the tokens as dictionaries for the model.

In [None]:
script['dict'] = script.apply(lambda row: dict([token, True] for token in row['clean_tokens']), axis=1)
script.head()

And apply the model, creating a new column: *sentiment*.

In [None]:
script['sentiment'] = script.apply(lambda row: model.prob_classify(row['dict']).prob('pos'), axis=1)
script.head()

This is a very simple model that assigns a probability, between 0 and 1, given a collection of tokens. A human can't distinguish if the sentences *Good evening, Professor Dumbledore. Are the rumours true?* are positive or negative, but this model has assigned a negative value to the word *evening* because the sample of tweets with which it has been trained contains more negative tweets with the word *evening* than positive. See below.

Similarly, the sentence *Hagrid is bringing him.* is classified as positive because there are more positive tweets in the sample with the verb *bring* than negative.

In [None]:
words = ['good', 'evening', 'professor', 'dumbledore', 'rumour', 'true', 'hagrid', 'bring']

print('Probability that a tweet with some word is positive:\n')

for word in words:
    dist = model.prob_classify({word: True})
    x = dist.prob('pos') * 100
    category = 'POS' if x > 55 else ('NEG' if x < 45 else 'neutral')
    print(f'p({word}) = {round(x, 2)} % ({category})')

In [None]:
mask = script['character_name'] == 'Draco Malfoy'
columns = ['dialogue', 'clean_tokens', 'sentiment']
script.loc[mask, columns].head(10)

We observe that even when no tokens are left for the model to be applied, it still classifies the empty list: the 7th line of Draco Malfoy, *No!*, is a stop word that is discarded. This should be set to 0.5 probability as a default.

Let's copy this dataframe and rescale the value of *sentiment* to better analyse it: *sentiment > 0* if positive, else *sentiment < 0*. We set *sentiment = 0* if the line has no relevant tokens (if the dictionary is empty).

In [None]:
df = script.copy()
df['sentiment'] = (2 * df['sentiment'] - 1).round(5)
df['sentiment'] = df.apply(lambda row: row['sentiment'] if len(row['dict']) > 0 else 0, axis=1)
df['sentiment_cat'] = df['sentiment'].apply(lambda x: 'POS' if x > 0.05 else ('NEG' if x < -0.05 else 'neutral'))
df.drop(columns=['tokens', 'dict'], inplace=True)
df.head()

We observe that the sentiment distribution is centered to -0.05, which is neutral. It's distribution is quite uniform, except for a peak of neutral lines at 0. This is due to the fact that many lines in this movie contain tokens, like proper nouns (*quidditch*, *weasley*, *voldemort*, *leviosa*...), that don't exist in the twitter model, hence are classified as neutral.

In [None]:
np.mean(df['sentiment'])

In [None]:
plt.title('Sentiment distribution.')
plt.hist(df['sentiment'], bins=11, color='b', edgecolor='k')
plt.xlabel('Sentiment')
plt.ylabel('Counts')
plt.grid()
plt.show()

We can add up all scores for each character and then divide by the total number of lines that each character has to find the average sentiment.

In [None]:
total = df[['character_name', 'sentiment']].groupby('character_name').sum()
total.loc['Harry Potter', 'sentiment']

In [None]:
character.reset_index(inplace=True)
character['sentiment'] = character.apply(lambda row: total.loc[row['name'], 'sentiment'] / lines[row['name']], axis=1)
character['sentiment_cat'] = character['sentiment'].apply(lambda x: 'POS' if x > 0.05 else ('NEG' if x < -0.05 else 'neutral'))
character.set_index('name', inplace=True)
character.head()

In [None]:
character.describe()

If we take the top 11 characters with more lines in the movie and sort them with their average sentiment, we observe that *mean* characters like Petunia or Vernon have positive average sentiments whilst *good* characters like Harry or Hagrid have negative average sentiments.

In [None]:
mask = character['words'] > 200
columns = ['sentiment', 'sentiment_cat']
character.loc[mask, columns].sort_values(by='sentiment', ascending=False)

This seems contradictory, but it is in fact accurate. Petunia and Vernon appear mostly talking to their son, Dudley, and almost all these lines are classified as positive. Let's see it:

In [None]:
mask = script['character_name'] == 'Petunia Dursley'
columns = ['dialogue', 'clean_tokens', 'sentiment', 'sentiment_cat']
df.loc[mask, columns].head(10)

The model works fine in most cases: Petunia talks very positively to Dudley, with positive tokens like *perfect*, *wonderful* or *darling*.

If we do the same with Hagrid, we observe quite the opposite: most lines are classified as negative.

In [None]:
mask = script['character_name'] == 'Rubeus Hagrid'
columns = ['dialogue', 'clean_tokens', 'sentiment', 'sentiment_cat']
df.loc[mask, columns].head(10)

In [None]:
tokens = ['wizard', 'harry']

print('Probability that a tweet with some word is positive:\n')

for token in tokens:
    dist = model.prob_classify({token: True})
    x = dist.prob('pos') * 100
    category = 'POS' if x > 55 else ('NEG' if x < 45 else 'neutral')
    print(f'p({token}) = {round(x, 2)} % ({category})')

As observed above, maybe the model has some bias with the word *harry* being classified as negative.

# 4. Further analysis.

Another model coming from different data should be used to better analyse the sentiments through this movie. Also, proper nouns should be discarded from the part of speech classification.