This notebook needs a lot of libraries. Importing most of them here and bringing in the dataset. Much of the LDA pipeline used in this script was originally written by Selva Prabhakaran at Machine Learning Plus (From the notebook [here](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/)). I've adapted parts of it, but mostly it is his work. 

This notebook is still under construction. I've made it public to facilitate collaboration but will be updating regularly. 

In [None]:
import pandas as pd
import nltk, gensim, re, math
import en_core_web_sm
import matplotlib.ticker
import pyLDAvis.gensim
import gensim.corpora as corpora
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import matplotlib.colors as mcolors
from gensim.utils import lemmatize, simple_preprocess
from gensim.models import CoherenceModel
from collections import Counter
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('/kaggle/input/superheroes-nlp-dataset/superheroes_nlp_dataset.csv')

# Exploratory Data Analysis, Data Cleaning

I'm immediately making a copy of the dataframe- we'll come back to the original to do some NLP after our exploratory data analysis. There are a good chunk of columns here- 81, in total. Not all of them are going to be useful but some of them will have to be split into additional columns.

In [None]:
cluster_df = df.copy()
print(cluster_df.shape)


Taking a look at these by type of variable. First 'object'. 

In [None]:
cluster_df.describe(include = 'object')

There's one column I want to dig unto immediately- 'first_appearance'. Have the powers and abilities given to superheroes changed over time? Unfortunately, this column doesn't have a lot to offer. Some detailed background research could get years for each of these observations, but without that work we'd have to throw out a lot of rows to use this data. 

In [None]:
cluster_df['first_appearance'].head(n=20)

Some of this will be useful later on for some NLP effort, but for now a lot will be discarded. Maybe come back to "history_text" to see if there's something in common for certain creators. Name related columns, place of birth, teams, and relatives will all get discarded. Gender, height, and weight I definitely want, but will need to recode.

Moving on to the float values. There are 50 of them, all based on powers that the supes in question either have or don't. 

In [None]:
cluster_df.describe(include = 'float')

Moving on to the ints, there are only six. They are all the scores given to the individual supes on six different metrics. Perhaps add one "overall score"?

In [None]:
cluster_df.describe(include = 'int')

Before even digging into distributions and all that, I'm doing some feature engineering to turn height and weight into useable variables instead of a weird text variable. 

In [None]:
cluster_df['height_clean'] = cluster_df.height.str.extract(r'(\d+)\s*cm', expand=True)
cluster_df['weight_clean'] = cluster_df.weight.str.extract(r'(\d+)\s*kg', expand=True)
cluster_df['height_clean'] = pd.to_numeric(cluster_df['height_clean'], errors='coerce')
cluster_df['weight_clean'] = pd.to_numeric(cluster_df['weight_clean'], errors='coerce')

cluster_df['height_clean'] = cluster_df['height_clean'].fillna(cluster_df['height_clean'].median())
cluster_df['weight_clean'] = cluster_df['weight_clean'].fillna(cluster_df['weight_clean'].median())
cluster_df['height_clean'] = cluster_df['height_clean'].astype('int')
cluster_df['weight_clean'] = cluster_df['weight_clean'].astype('int')

Now lets take a look at distributions. Starting with the power "scores" that each super received. It appears as though the scores were given quite generously- there is a clear skew towards the top end of the scale, especially for intelligence, power, and combat scores. Interestingly, for strength and combat, there's also a spike near the bottom of the range. This is probably reserved for characters who get by on deviousness and intelligence. 

In [None]:
sns.pairplot(cluster_df.select_dtypes(include=['int']), plot_kws={'alpha': 0.1})

Out of curiousity, I'm going to pull that thread:

In [None]:
sns.pairplot(cluster_df[(cluster_df['strength_score'] < 25) | (cluster_df['combat_score'] < 25)].select_dtypes(include=['int']), plot_kws={'alpha':0.1})

That's interesting- few supes with low combat scores had a high strength score, but the inverse was not true. Plenty of supes with a low strength score had a high combat score. As we predicted, the intelligence scores skew quite high here. There's a wider range of speed, durability, and power than I expected but we do see that the distributions skew lower than the original inquiry. 

There are some pesky outliers in both height and weight. Let's check those out

In [None]:
cluster_df[(cluster_df['height_clean'] > 500) & (cluster_df['weight_clean'] < 500)]

I'm not a comic book expert but I'm fairly certain that Bruce Wayne is not supposed to be 30 ft tall. The others I'm not sure about. Part of the magic of superheroes is that they're outliers and I don't know these eight characters well enough to know if this is poor data or if the height and weight values are accurate. Because I don't know (and don't want to read up on these characters to find out) I'm going to just keep them in. 


In [None]:
print(cluster_df.shape)
cluster_df = cluster_df[cluster_df['name'] != "Batman (1966)"]
print(cluster_df.shape)

Looking at some of the weight outliers, we can see that they're not really outliers, just a cluster of Hulks and other supes who sound like they would be really heavy. I'm willing to let these stay. 

In [None]:
cluster_df[cluster_df['weight_clean'] > 750]

There also appear to be quite a few supes who have values of zero for multiple "scores." Lets take a look at them

In [None]:
cluster_df[(cluster_df['strength_score'] == 0) & (cluster_df['intelligence_score'] == 0)]

I'm going to remove all of these supes. I could try to impute a value for the scores for each of them, but I don't think calling all 106 of them "average" supes is a solid approach. 

In [None]:
print(cluster_df.shape)
cluster_df = cluster_df[(cluster_df['strength_score'] != 0) & (cluster_df['intelligence_score'] != 0)]
print(cluster_df.shape)

Taking a look at the pair plot, we see numbers that begin to make a little more sense.

In [None]:
sns.pairplot(cluster_df.select_dtypes(include=['int']), plot_kws={'alpha': 0.1})

I want to turn some of the other columns in the dataframe into a more useable form- 'creator', 'alignment', and 'gender' are all given dummy variables here. I'll check the correlations later to see if I need to drop any of the dummies.

In [None]:
cluster_df = pd.get_dummies(cluster_df, columns=['creator','alignment','gender'])

I want to check out 'type_race.' I think this could be a really helpful column, but I also think that it could require some feature engineering to add real value. 

In [None]:
cluster_df['type_race'].value_counts()

In [None]:
plt.figure(figsize=(8,14))
sns.countplot(y=cluster_df['type_race'], order = cluster_df['type_race'].value_counts().index)

In [None]:
cluster_df['type_race'] = cluster_df['type_race'].fillna('Unknown')

There are some columns I don't want for this portion of my analysis. It's possible that 'teams' could add some value here, but it seems like cheating to add 'teams' to the analysis since I'm trying to find natural clusters of supes. I'm also throwing out 'overall_score' because I want to build my own metric for an overall score of a supes power. Note that 'height' and 'weight' are thrown out here, because I renamed the clean height and weight columns. 

In [None]:
bad_cols = ['name','real_name','full_name','overall_score','history_text',
            'powers_text','superpowers','alter_egos','aliases','place_of_birth',
            'first_appearance','occupation','base','teams','relatives',
            'height','weight','eye_color','hair_color', 'skin_color',
            'img']
cluster_df = cluster_df[[c for c in cluster_df.columns if c not in bad_cols]]

Next I'm handling the columns of powers. There are 50 of these bad boys, with a lot of NAs. For our purposes, I'm going to say that a NA means that a given supe does not have that power. 

In [None]:
power_cols = ['has_electrokinesis','has_energy_constructs','has_matter_manipulation', 'has_telepathy_resistance',
            'has_mind_control','has_enhanced_hearing','has_dimensional_travel', 'has_element_control','has_size_changing',
            'has_fire_resistance','has_fire_control','has_dexterity','has_reality_warping','has_illusions','has_energy_beams',
            'has_peak_human_condition','has_shapeshifting','has_jump','has_self-sustenance','has_energy_absorption',
            'has_cold_resistance','has_magic','has_telekinesis','has_toxin_and_disease_resistance','has_telepathy',
            'has_regeneration','has_immortality','has_teleportation','has_force_fields','has_energy_manipulation',
            'has_endurance','has_longevity','has_weapon-based_powers','has_energy_blasts', 'has_enhanced_senses','has_invulnerability',
            'has_stealth','has_marksmanship','has_flight', 'has_accelerated_healing', 'has_weapons_master', 'has_intelligence', 'has_reflexes',
            'has_super_speed','has_durability','has_stamina','has_agility','has_super_strength', 'has_heat_resistance',
            'has_mind_control_resistance']

cluster_df[power_cols] = cluster_df[power_cols].fillna(0)

One quick check to make sure we've handled all the NAs.

In [None]:
print(cluster_df.isnull().any().sum())

With the NAs handled we can look at the correlations between existing columns. If any columns have to strong of a correlation, we should do something about it or the values will essentially be double-counted in analysis. 

In [None]:
corrmat = cluster_df.corr()
sns.heatmap(corrmat, vmax=0.9, square=True, center = 0, cmap = 'viridis')

In [None]:
power_corrmat = cluster_df[power_cols].corr()
sns.heatmap(power_corrmat, vmax=0.9, square=True, center = 0, cmap = 'viridis')

In [None]:
corrmat.abs().unstack().sort_values().drop_duplicates().sort_values(kind='quicksort', ascending=False).head(n=20)


Some of the relationships between powers are really interesting. It makes total sense that a supe with Fire Resistance would also have Heat Resistance. Picking and choosing from these powers could be tricky but there are two easy wins here: gender and alignment. It's true that there are some genderless and neutral supes, but we shouldn't lose much taking out these two columns since the neutral parties should still be accounted for. 

In [None]:
cluster_df.drop(columns=['alignment_Bad', 'gender_Female'])

With the NAs handled and our data clean, we can do some feature engineering to get a new overall power metric.

# Feature Engineering

I want to create a better metric of a supes overall power- taking the scores given in the data as well as the powers that supe has into effect. The easiest thing to do would be count everything equally and take a sum, but based on the distributions we saw in the pairplots, I don't think we can do that. To revisit, let's take a look. 

In [None]:
score_cols = ['intelligence_score', 'strength_score', 'speed_score', 'durability_score', 'power_score', 'combat_score']
sns.pairplot(cluster_df[score_cols], plot_kws={'alpha': 0.1})

In [None]:
cluster_df['overall'] = cluster_df[score_cols].sum(axis=1) / 6
sns.distplot(cluster_df['overall'], rug=True)


In [None]:
score_cols.append('overall')
sns.pairplot(cluster_df[score_cols], plot_kws={'alpha': 0.1})

I'm going to try the naive approach first and see how it works. The answer? Surprisingly well. Though none of the distributions of the individual scores are anywhere near normal, this is pretty close. Evidently whoever created these scores along with the writers actually do a pretty good job of keeping the supes "balanced." There's clear bimodality here near the top of the range. This isn't entirely unexpected- there are going to be some overpowered supes. I would wager that  a good number of these these ultra-powerful beings are going to be villians that require the cooperation of teams like the Avengers to take them down.

To pull that thread, let's plot it. Indeed, a higher portion of those uber-powerful beings are villians. 

In [None]:
heroes = cluster_df.loc[cluster_df['alignment_Good'] == 1]
villians = cluster_df.loc[cluster_df['alignment_Bad'] == 1]

sns.distplot(heroes['overall'], color = '#3498db') #blue
sns.distplot(villians['overall'], color = '#e74c3c') #red

There's more engineering to be done though. I'm going to take a look at 'type_race'. 

In [None]:
cluster_df = pd.get_dummies(cluster_df, columns=['type_race'])

Awesome. Lets get into some 
# Clustering

In [None]:
distortions = []
K = range(1,20)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(cluster_df)
    kmeanModel.fit(cluster_df)
    distortions.append(sum(np.min(cdist(cluster_df, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / cluster_df.shape[0])

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

Here's the K-Means model using 8 clusters, the best choice based on the elbow plot above. There are better visualizations that could be used here, either a 3D plot or using some PCA to reduce the dimensionalty. 

In [None]:
kmeans = KMeans(n_clusters=8)
kmeans.fit(cluster_df)
pred = kmeans.predict(cluster_df)
cluster_df['clust_pred'] = pred
sns.scatterplot(cluster_df['overall'], cluster_df['strength_score'], hue = cluster_df['clust_pred'])

In [None]:
X = cluster_df.copy()
X.drop('clust_pred', axis=1, inplace=True)
y = pred

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestClassifier(max_depth=4, n_estimators = 2000)
rf.fit(X_train, y_train)

y_rf_pred = rf.predict(X_test)

In [None]:
importances = pd.Series(data=rf.feature_importances_,
                        index= X_train.columns)

# Sort importances
importances_sorted = importances.sort_values()
importances_sorted = importances_sorted[importances_sorted > 0.02]
# Draw a horizontal barplot of importances_sorted
importances_sorted.plot(kind='barh', color='lightgreen')
plt.title('Features Importances')
plt.show()

In [None]:
print(accuracy_score(y_test, y_rf_pred))

# NLP 

I'm going to start with some basic analysis of the entire corpus of history and power texts. I'm back to working with 'df', the dataframe I brought in at the start of the notebook. First, I bring all the text into one string for history and one string for powers. 

In [None]:
all_hist_text = ""
all_pow_text = ""

for index, row in df.iterrows():
    if not pd.isna(row['history_text']):
        all_hist_text = all_hist_text + " " + row['history_text']
    if not pd.isna(row['powers_text']):
        all_pow_text = all_pow_text + " " + row['powers_text']
     

In some pre-work, I was able to get some additional stopwords that I want to take out of each of these texts, respectively. I tokenize the text, remove the stopwords, and plot the most common terms. Nothing super surprising here, just some basic NLP.

In [None]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

hist_stops = ['time','x','man','would','later','however','also','men','could']

pow_stops = ['able','also','even','ability','one','use','without','could', 'power']

hist_stopwords = stopwords + hist_stops
pow_stopwords = stopwords + pow_stops

tokenizer = RegexpTokenizer(r'\w+')
hist_tokens = tokenizer.tokenize(all_hist_text.lower())
pow_tokens = tokenizer.tokenize(all_pow_text.lower())

hist_tokens = [token for token in hist_tokens if token not in hist_stopwords]    
pow_tokens = [token for token in pow_tokens if token not in pow_stopwords] 

hist_freq_dist = nltk.FreqDist(hist_tokens)
hist_freq_dist.plot(25)

pow_freq_dist = nltk.FreqDist(pow_tokens)
pow_freq_dist.plot(25)

We can get more advanced though. The next portion of the notebook gets into topic modeling. I've structured the code here somewhat differently. This was originally written to be a part of a more versitile script, so I've left this code in function form. 

In [None]:
def preprocess(text_list):
    text_list = [text for text in text_list if type(text)==str]
    for text in text_list:
        #Common acronym that I don't want to lose
        #lowercases, tokenizes, and deaccents, outputs tokens
        text = re.sub("X-Men", "Xmen", text)
        text = simple_preprocess(str(text), deacc=True)   
        yield text


In [None]:
def bis_n_tris(words, threshold = 75):
    bigram = gensim.models.Phrases(words, min_count=2, threshold=threshold) # higher threshold fewer phrases.
    trigram = gensim.models.Phrases(bigram[words], threshold=threshold)  
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram_mod = gensim.models.phrases.Phraser(trigram)
    return bigram_mod, trigram_mod

In [None]:
def process_words(texts, stopwords, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'], threshold=100):
    """Remove Stopwords, Form Bigrams, Trigrams and Lemmatization"""
    
    bigram_mod, trigram_mod = bis_n_tris(texts, threshold = threshold)
    
    texts = [bigram_mod[doc] for doc in texts]
    texts = [trigram_mod[bigram_mod[doc]] for doc in texts]
    texts_out = []
    nlp = en_core_web_sm.load()
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    # remove stopwords once more after lemmatization
    texts_out = [[word for word in simple_preprocess(str(doc)) if word not in stopwords] for doc in texts_out]    
    return texts_out

In [None]:
def create_model_and_corpus(words, num_topics):
    id2word = corpora.Dictionary(words)
    corpus = [id2word.doc2bow(text) for text in words]
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=num_topics, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=10,
                                           passes=10,
                                           alpha='symmetric',
                                           iterations=100,
                                           per_word_topics=True)
    return lda_model, corpus

In [None]:

def format_topics_sentences(corpus, texts, ldamodel=None):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    dominant_topic_df = sent_topics_df.reset_index()
    dominant_topic_df.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
    return sent_topics_df, dominant_topic_df

In [None]:
def doc_word_count_plot(df_dominant_topic):
    doc_lens = [len(d) for d in df_dominant_topic.Text]
    
    # Plot
    plt.figure(figsize=(16,7), dpi=160)
    plt.hist(doc_lens, bins = 1000, color='navy')
    plt.text(750, 100, "Mean   : " + str(round(np.mean(doc_lens))))
    plt.text(750,  90, "Median : " + str(round(np.median(doc_lens))))
    plt.text(750,  80, "Stdev   : " + str(round(np.std(doc_lens))))
    plt.text(750,  70, "1%ile    : " + str(round(np.quantile(doc_lens, q=0.01))))
    plt.text(750,  60, "99%ile  : " + str(round(np.quantile(doc_lens, q=0.99))))
    
    plt.gca().set(xlim=(0, 1000), ylabel='Number of Documents', xlabel='Document Word Count')
    plt.tick_params(size=16)
    plt.xticks(np.linspace(0,1000,9))
    plt.title('Distribution of Document Word Counts', fontdict=dict(size=22))
    plt.show()

In [None]:
def plot_words_by_dominant_topic(df_dominant_topic, num_topics):
    
    cols = [color for name, color in mcolors.TABLEAU_COLORS.items()] # more colors: 'mcolors.XKCD_COLORS'
    p_width = 2
    p_height = math.ceil(num_topics/p_width)
    
    fig, axes = plt.subplots(p_width,p_height,figsize=(16,14), dpi=160, sharex=True, sharey=True)
    
    for i, ax in enumerate(axes.flatten()):    
        df_dominant_topic_sub = df_dominant_topic.loc[df_dominant_topic.Dominant_Topic == i, :]
        doc_lens = [len(d) for d in df_dominant_topic_sub.Text]
        ax.hist(doc_lens, bins = 1000, color=cols[i])
        ax.tick_params(axis='y', labelcolor=cols[i], color=cols[i])
        sns.kdeplot(doc_lens, color="black", shade=False, ax=ax.twinx())
        ax.set(xlim=(0, 1000), xlabel='Document Word Count')
        ax.set_ylabel('Number of Documents', color=cols[i])
        ax.set_title('Topic: '+str(i), fontdict=dict(size=16, color=cols[i]))
    
    fig.tight_layout()
    fig.subplots_adjust(top=0.90)
    plt.xticks(np.linspace(0,1000,9))
    fig.suptitle('Distribution of Document Word Counts by Dominant Topic', fontsize=22)
    plt.show()

In [None]:
def get_weights_and_counts(lda_model, words_cleaned):
    topics = lda_model.show_topics(formatted=False)
    data_flat = [w for w_list in words_cleaned for w in w_list]
    counter = Counter(data_flat)
    out = []
    for i, topic in topics:
        for word, weight in topic:
            out.append([word, i , weight, counter[word]])
    
    df = pd.DataFrame(out, columns=['word', 'topic_id', 'importance', 'word_count'])  
    return df  

In [None]:
def word_count_weight_plot(df, num_topics):
    p_width = 2
    p_height = math.ceil(num_topics/p_width)
    fig, axes = plt.subplots(p_width, p_height, figsize=(16,10), sharey=True, dpi=160)
    cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]
    wc_max = max(df['word_count'])*1.2
    importance_max = max(df['importance'])*1.2
    for i, ax in enumerate(axes.flatten()):
        ax.bar(x='word', height="word_count", data=df.loc[df.topic_id==i, :], color=cols[i], width=0.5, alpha=0.3, label='Word Count')
        ax_twin = ax.twinx()
        ax_twin.bar(x='word', height="importance", data=df.loc[df.topic_id==i, :], color=cols[i], width=0.2, label='Weights')
        ax.set_ylabel('Word Count', color=cols[i])
        ax_twin.set_ylim(0, importance_max); ax.set_ylim(0, wc_max)
        ax.set_title('Topic: ' + str(i), color=cols[i], fontsize=16)
        ax.tick_params(axis='y', left=False)
        ax.set_xticklabels(df.loc[df.topic_id==i, 'word'], rotation=30, horizontalalignment= 'right')
        ax.legend(loc='upper left'); ax_twin.legend(loc='upper right')
        l = ax.get_ylim()
        l2 = ax_twin.get_ylim()
        f = lambda x : l2[0]+(x-l[0])/(l[1]-l[0])*(l2[1]-l2[0])
        ticks = f(ax.get_yticks())
        ax_twin.yaxis.set_major_locator(matplotlib.ticker.FixedLocator(ticks))  
    fig.tight_layout(w_pad=2)    
    fig.suptitle('Word Count and Importance of Topic Keywords', fontsize=22, y=1.05)    
    plt.show()

In [None]:

def mct(texts, stopwords, num_topics):
    
    print("Warning: This will crash if you've picked too many topics for your dataset")
    texts_clean = list(preprocess(texts))
    tokens = process_words(texts_clean, stopwords = stopwords)
    lda_model, corpus = create_model_and_corpus(tokens, num_topics = num_topics)
    return lda_model, corpus, tokens

In [None]:
def other_visualizations(lda_model, corpus, tokens):
    
    sent_df, dom_topic_df = format_topics_sentences(corpus, texts = tokens, ldamodel=lda_model)
    doc_word_count_plot(dom_topic_df)
    plot_words_by_dominant_topic(dom_topic_df)
    wc_df = get_weights_and_counts(lda_model, tokens)
    word_count_weight_plot(wc_df)

In [None]:
def lda_pipeline(texts, stopwords, num_topics):
    print("Warning: This will crash if you've picked too many topics for your dataset")
    texts_clean = list(preprocess(texts))
    tokens = process_words(texts_clean, stopwords = stopwords)
    lda_model, corpus = create_model_and_corpus(tokens, num_topics = num_topics)
    sent_df, dom_topic_df = format_topics_sentences(corpus, texts = tokens, ldamodel=lda_model)
    doc_word_count_plot(dom_topic_df)
    plot_words_by_dominant_topic(dom_topic_df, num_topics = num_topics)
    wc_df = get_weights_and_counts(lda_model, tokens)
    word_count_weight_plot(wc_df, num_topics= num_topics)
    #Removing for DEMO
    #pyLDAvis.enable_notebook()
    #pyLDAvis.gensim.prepare(lda_model, corpus, dictionary=lda_model.id2word)
   # abs_vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary=lda_model.id2word)
   # pyLDAvis.show(abs_vis) 
    # pyLDAvis.display(abs_vis)

Putting it all together here. I create lists of the history and power data for each supe and then run my fancy functions on them to show the LDA visualizations. 

In [None]:
hist_data = df['history_text'].values.tolist()
pow_data = df['powers_text'].values.tolist()


hist_lda, hist_corpus, hist_tokens = mct(hist_data, stopwords = hist_stopwords, num_topics = 8)
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(hist_lda, hist_corpus, dictionary=hist_lda.id2word)

Eight topics may be too many- there really appear to be only four distinct topics per the PCA. That said, there are some interesting distinctions between topics that are close, e.g. Topic 7 is Tony Stark related, but Topic 6, which is right next to it, is a menagerie of wizards, anime characters, and supernatural characters.

In [None]:
pow_lda, pow_corpus, pow_tokens = mct(pow_data, stopwords = pow_stopwords, num_topics = 4)
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(pow_lda, pow_corpus, dictionary=pow_lda.id2word)

This is really interesting- there are four distinct topics in powers. One is related to the supernatural, one is combat prowess, one is superhuman strength/speed/durability, and one is energy manipulation. 

In [None]:
lda_pipeline(hist_data, hist_stopwords, num_topics = 8)

Show graphics better

In [None]:
sent_df, dom_topic_df = format_topics_sentences(hist_corpus, texts = hist_tokens, ldamodel=hist_lda)
doc_word_count_plot(dom_topic_df)

In [None]:
plot_words_by_dominant_topic(dom_topic_df, num_topics = 8)

In [None]:
wc_df = get_weights_and_counts(hist_lda, hist_tokens)
word_count_weight_plot(wc_df, num_topics= 8)