# South Park Character Predictor
Welcome to the South Park Character Predictor, where we will be using a dataset of South Park dialogues to predict which character in the show is speaking.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import PassiveAggressiveClassifier, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler, MaxAbsScaler, Normalizer
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, LSTM, Embedding, Dropout

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('../input/southparklines/All-seasons.csv')
seasons = data['Season']
episodes = data['Episode']
lines = data['Line']

# Feature engineering
Firstly, we will do feature engineering. Below is the dataset which we given. It consists of over 70,000 lines in the show, along with what season and episode it is in and which character speaks it.

In [None]:
data

In the initial data, there is a '\n' at the end of every line. The function this performs is that it creates a new line when printed out. The next cell removes all instances of them from every point in the input.

In [None]:
d = []
for i in data['Line']:
    d.append(str(i)[:-1])
data['Line'] = d

Next, we define X to be the 'Line' feature and y to be the 'Character' column.

In [None]:
X = data['Line']
y = data['Character']

In [None]:
count = Counter(y)
names = []
X = []
y = []

Now we remove every person which has spoken less than 1,000 times. This is done so that we can remove all the unimportant people in the series, focusing only on the main characters.

In [None]:
i = 0
for j in count.keys():
    if list(count.values())[i] > 1000:
        names.append(list(count.keys())[i])
    i += 1

In [None]:
i = 0
while i < len(data['Character']):
    if data['Character'][i] in names:
        X.append(data['Line'][i])
        y.append(data['Character'][i])
    i += 1

In [None]:
X[:10]

In [None]:
y[:10]

As seen above, the target (y) input is categorical, so we need to convert it into numerical. This is performed by the LabelEncoder function, which assigns a number to every unique value in the array.

In [None]:
target = y
le = LabelEncoder()
y = le.fit_transform(y)

One of the most necessary pieces of engineering we can do is splitting the X and y into train and test sets. We accomplish this task through the train_test_split function. This splits our data to be 80% train (which is used to train the model) and 20% test (which is used to evaluate how well the model has done).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Since the dialogue lines in our X data are textual, we must convert them into numerical. This is done using a Bag of Words method and subsequently a TF-IDF method.

In [None]:
cv = CountVectorizer()
tfidf = TfidfTransformer()
normalizer = Normalizer()

Xtr_bow = cv.fit_transform(X_train)
Xte_bow = cv.transform(X_test)

X_train = tfidf.fit_transform(Xtr_bow)
X_test = tfidf.fit_transform(Xte_bow)

X_train = normalizer.fit_transform(X_train)
X_test = normalizer.transform(X_test)

# Data visualisation
Next, we will visualise the data.

The first part will be using a word cloud to determine which words are the most common. As we can see below, the most frequently used words are 'know', 'now', 'right', 'well' and 'oh'.

In [None]:
wordcloud = WordCloud(background_color='white').generate(' '.join(lines))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Afterwards, we will examine three bar graphs which study how much dialogue there was in relation to the characters, season and episode number.
* The first graph shows that the character which speaks the most is Cartman, with him speaking almost 10,000 times. Stan and Kyle follow with over 7,000, then Butters, Randy and Mr Garrison.
* The second graph shows that the amount of lines dropped as the seasons went by, with the decline beginning at season 7. Seasons 2-7 had all above 4,000 lines, reaching a maximum of over 6,000, however after season 7 the average was around 3,000.
* The third graph shows that episodes 1-14 in per season had roughly 4-5 thousand lines, however episodes 15-18 had less than 2,000.

In [None]:
#lines per character
countY = Counter(target)
plt.bar(countY.keys(), countY.values(), color='blue')
plt.title('Distribution of character lines')
plt.ylabel('Number of lines')
plt.xlabel('Character')
plt.show()

#lines per season
countS = Counter(seasons)
del countS['Season']

vals = list(countS.values())[9:]+list(countS.values())[:9]
keys = list(countS.keys())[9:]+list(countS.keys())[:9]
countS = dict(zip(keys, vals))

plt.bar(countS.keys(), countS.values(), color='red')
plt.title('Distribution of line per season')
plt.ylabel('Number of lines')
plt.xlabel('Season')
plt.show()

#lines per episode
countE = Counter(episodes)
del countE['Episode']

plt.bar(countE.keys(), countE.values(), color='green')
plt.title('Distribution of lines per episode')
plt.ylabel('Number of lines')
plt.xlabel('Episode in season')
plt.show()

# Classification
The final part of this notebook will be creating a model which predicts the characters speaking in the show. Here, we test three different classifiers: Naive Bayes, Linear SVC and Passive Agressive.

We loop over all these predictors and fit them to X and y train. Following, we evaluate their scores using accuracy score, model score and cross val score. As we can see below, the Linear SVC does the best, as it reaches scores of 45%, while the others perform at around 41%.

In [None]:
models = [MultinomialNB(), LinearSVC(), PassiveAggressiveClassifier()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    score = model.score(X_test, y_test)
    cross_val = cross_val_score(model, X_test, y_test).mean()

    print(model, 'accuracy:', accuracy, ' score:', score, ' cross_val:', cross_val)

After tuning it with a Grid Search CV, the scores of the Linear SVC are shown below, with accuracy and model scores being 45% and cross val score being 41%.

In [None]:
model = LinearSVC(C=0.4, loss='squared_hinge', penalty='l2', tol=0.1, multi_class='ovr')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
score = model.score(X_test, y_test)
cross_val = cross_val_score(model, X_test, y_test).mean()

print('accuracy:', accuracy, ' score:', score, ' cross_val:', cross_val)

## Thank you for reading my notebook.

## If you enjoyed this notebook and found it useful, please upvote it as it will help me make more of these.