### <span style="color:red">Introduction</span>
#### A glance over the data shows that it allows for various studies into different aspects of the popularity of some YouTube videos (or lack-there-of). In particular perhaps, the types of videos which happen to be more contentious. So let's look at the ESPN data for example:

In [None]:
import numpy as np 
import pandas as pd 

espn = pd.read_csv("../input/youtube-video-statistics/ESPN.csv")
espn.head()

### <span style="color:red">Assembling a table for analysis</span>
#### One way of managing the complexity of so many videos, is clustering them into several topics. To do so, first we need to vectorize the "title" column:

In [None]:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

class LemmaTokenizer: # Keeping terms composed of only alphabets
  def __init__(self):
   self.wnl = WordNetLemmatizer()
  def __call__(self, doc):
   return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if re.match(r'(?u)\b[A-Za-z]+\b',t)] 

from sklearn.feature_extraction.text import TfidfVectorizer

def warn(*args, **kwargs): # Turning off the warnings
    pass
import warnings
warnings.warn = warn

vectorizer = TfidfVectorizer(lowercase = False, tokenizer=LemmaTokenizer(), stop_words='english', min_df=0.001, max_df=0.99)
corpus = vectorizer.fit_transform(espn['title'])

#### Now that we have the "corpus" (in the form of a sparse matrix), we can dig out topics using a suitable matrix factorization.    

#### The classic method here would be SVD (Singular Value Decomposition). SVD in particular, generates a byproduct called sigma matrix which is useful in some applications . SVD is also an exact factorization method. There is also NMF (Nonnegative Matrix Factorization), which may not deliver the mentioned perks, yet is otherwise fast and straightforward. In particular, NMF unlike SVD, does not produce negative values, which turn out to be tricky when it comes to interpretation (for more information on how NMF compares to SVD you can check out [here](https://medium.com/@nixalo/comp-linalg-l2-topic-modeling-with-nmf-svd-78c94330d45f) and [here](https://discuss.analyticsvidhya.com/t/how-is-svd-different-from-other-matrix-factorization-techniques-like-non-negative-matrix-factorization/67519/4)). So long story short, let's choose NMF over SVD to produce 20 topics (which sounds enough for sports category):

In [None]:
from sklearn.decomposition import NMF 
model = NMF(n_components=20, init='random', random_state=0)
corpus_by_topics = model.fit_transform(corpus) 

#### Next I'll assign the most relevant topic to each document in the corpus (corresponding to each title in the main table).

In [None]:
max_index_in_row = np.argmax(corpus_by_topics, 1)
max_value_in_row = np.amax(corpus_by_topics, 1)

print(max_index_in_row)
print(max_value_in_row)

#### We can statistically look at how accurately the assigned topic describes each document: 

In [None]:
from scipy import stats
stats.describe(max_value_in_row)

#### An average of almost 0.2 (with a rather small variance), doesn't seem to be terrible for 20 topics. Next I'll assemble a new table to work on: 

In [None]:
topic = pd.Series(max_index_in_row)
relevance = pd.Series(max_value_in_row)
frame = {'topic':topic, 'relevance':relevance}
df = pd.DataFrame(frame)

espn_derived = pd.concat([df, espn[['likes', 'dislikes', 'comments']]], axis=1, sort=False)
espn_derived.head()

#### Cleaning up the invalid / poor values which happen to exist in the dataframe:

In [None]:
#getting rid of negative values
stats.describe(espn_derived['comments'])
espn_derived['comments'] = np.where(espn_derived['comments'] < 0, 0, espn_derived['comments'] )

#removing outliers
z_scores = stats.zscore(espn_derived)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)
espn_derived = espn_derived[filtered_entries]

### <span style="color:red">Topic analysis</span>
#### Visualizing five dimensions of the dataframe: likes (x-axis), dislikes (y-axis), topic (color), number of comments (bubble size), the relevance of the topic (transparency of the bubble):

In [None]:
import seaborn as sns
sns.set(style="white")
import matplotlib.pyplot as plt

#used https://mokole.com/palette.html to generate 20 visually distinct colors
colors=[ "#696969", "#ffe4c4", "#2e8b57", "#8b0000", "#808000", "#000080", "#ff0000", "#ff8c00", "#ffd700", "#ba55d3", "#00ff7f", "#00bfff", "#0000ff", "#adff2f", "#ff00ff", "#f0e68c", "#fa8072", "#dda0dd", "#ff1493", "#7fffd4"]

plt.figure(figsize=(15,8))

# it is not possible to parametrize alpha in Seaborn; so I define three levels of alphas and draw them seperately.
espn_derived["alpha"] = np.where(espn_derived['relevance'] < 0.22, 0.1, np.where(espn_derived['relevance'] < 0.4, 0.4, 0.5))

ax = sns.scatterplot(x="likes", y="dislikes", hue="topic", size="comments", sizes=(1,500), size_norm=(100,4000), alpha=0.1, palette=sns.color_palette(colors), data=espn_derived[espn_derived['alpha']==0.1])

sns.scatterplot(legend=False, ax=ax, x="likes", y="dislikes", hue="topic", size="comments",sizes=(1,500), size_norm=(100,4000), alpha=0.4, palette=sns.color_palette(colors), data=espn_derived[espn_derived['alpha']==0.4])

sns.scatterplot(legend=False, ax=ax, x="likes", y="dislikes", hue="topic", size="comments", sizes=(1,500), size_norm=(100,4000),alpha=0.5, palette=sns.color_palette(colors), data=espn_derived[espn_derived['alpha']==0.5])

plt.plot([0,3000], [0,3000], color='r')

plt.plot([0,14000], [0,1800], color='b')

plt.legend(bbox_to_anchor=(1, 1), loc=2)

ax.legend(ncol=2)


#### The red line delineates where the number of likes equals the number of dislikes. So in a way, one can call the points on this line, controversial. But the situation seems to be more nuanced. For example we usually call the points close to the vertical axis also controversial. Or considering the fact that videos normally seem to get more likes than dislikes, something like the blue delineator might have some metirts to it as well.  At any rate, the plot right-off-the-bat, reveals curious materials for inference:    
* #### Viewers do not seem to comment on videos when they haven't hit like or unlike (as the number of comments are considerably lower around the origin).
* #### Some of the most liked and unliked videos (bubbles close to the tip of the margins), are not discussed as much.
* #### The less nuanced topics (videos with lower relevance score), have lesser numbers of likes, dislikes, and comments.
* #### Some of the most overwhelmingly liked or disliked videos are nuanced topics.   

#### Now regarding the topics, topic #4 seems to be amongst the most contentious. A quick way to get a sense of the topic would be the wordcloud visualization: 

In [None]:

corpus_by_topics_model = model.fit(corpus) # the wieght of each term in every topic       
weight_dict = dict(zip(vectorizer.get_feature_names(), corpus_by_topics_model.components_[4])) # associating the actual terms with their weight

from wordcloud import WordCloud

wc = WordCloud(width=1600, height=800)
wc.generate_from_frequencies(weight_dict)
plt.figure(figsize=(15,8))
plt.imshow(wc) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show() 

#### Rather than just particular sports or technical aspects of them, topic #4 centers around some strong words which foment controversy. We may also take a look at topic #17 (associated with some of the most controversial videos):

In [None]:
weight_dict = dict(zip(vectorizer.get_feature_names(), corpus_by_topics_model.components_[17]))
 
from wordcloud import WordCloud

wc = WordCloud(width=1600, height=800)
wc.generate_from_frequencies(weight_dict)
plt.figure(figsize=(15,8))
plt.imshow(wc) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show() 

#### Topic #17 mostly centers around the basketball personalities with James Leborn on the top (also including Michael Jordan, Dwyane Wade, Tyronn Lue, etc). On the other side of the spectrum, topic #3 seems to be quite favored by most of the viewers:

In [None]:
weight_dict = dict(zip(vectorizer.get_feature_names(), corpus_by_topics_model.components_[3]))

from wordcloud import WordCloud

wc = WordCloud(width=1600, height=800)
wc.generate_from_frequencies(weight_dict)
plt.figure(figsize=(15,8))
plt.imshow(wc) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show()

#### Topic #3 is about Stephan A Smith (an ESON host) with a significant NFL undertone. Topic #0 is also among the favorites, albeit less than topic #0: 

In [None]:
weight_dict = dict(zip(vectorizer.get_feature_names(), corpus_by_topics_model.components_[0]))

from wordcloud import WordCloud

wc = WordCloud(width=1600, height=800)
wc.generate_from_frequencies(weight_dict)
plt.figure(figsize=(15,8))
plt.imshow(wc) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show() 

#### Basically the same ESPN host, but here with a heavy NBA undertone.