# **Closer look on Titles of Popular Videos**

[spaCy](https://spacy.io/usage/spacy-101) is a free, open-source library for advanced Natural Language Processing (NLP) in Python. 

> "Tokens" are usually individual words (at least in languages like English) and "tokenization" is taking a text or set of text and breaking it up into individual its words. These tokens are then used as the input for other types of analysis or tasks, like parsing (automatically tagging the syntactic relationship between words). We need to tokenize word so that we can use it for our title generating model/other cool analysis ([source](https://www.kaggle.com/rtatman/tokenization-tutorial))

In [23]:
#simple example 

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"An apple is not a banana")
for token in doc:
    print(token.text)

An
apple
is
not
a
banana



<table>
<tr><td>0 </td><td> 1 </td> <td>2 </td><td>3 </td><td>4 </td> <td>5 </td></tr>
<tr><td>An </td> <td>  apple</td><td>  is</td><td>not </td> <td>  a</td><td> banana</td></tr>
</table>

In [24]:
punctuations = string.punctuation
stopwords = list(STOP_WORDS)
parser = English()

def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens


tqdm.pandas()

normal = three_countries["title"][three_countries["popular"] == 0].progress_apply(spacy_tokenizer)
popular = three_countries["title"][three_countries["popular"] == 1].progress_apply(spacy_tokenizer)

100%|██████████| 27452/27452 [00:28<00:00, 963.18it/s]
100%|██████████| 2866/2866 [00:02<00:00, 973.33it/s] 


In [25]:
#tokenize words by popularity 

def word_generator(text):
    word = list(text.split())
    return word
def bigram_generator(text):
    bgram = list(nltk.bigrams(text.split()))
    bgram = [' '.join((a, b)) for (a, b) in bgram]
    return bgram
def trigram_generator(text):
    tgram = list(nltk.trigrams(text.split()))
    tgram = [' '.join((a, b, c)) for (a, b, c) in tgram]
    return tgram


normal_words = normal.progress_apply(word_generator)
popular_words = popular.progress_apply(word_generator)
normal_bigrams = normal.progress_apply(bigram_generator)
popular_bigrams = popular.progress_apply(bigram_generator)
normal_trigrams = normal.progress_apply(trigram_generator)
popular_trigrams = popular.progress_apply(trigram_generator)

100%|██████████| 27452/27452 [00:00<00:00, 297169.87it/s]
100%|██████████| 2866/2866 [00:00<00:00, 283571.40it/s]
100%|██████████| 27452/27452 [00:00<00:00, 52267.04it/s]
100%|██████████| 2866/2866 [00:00<00:00, 73730.06it/s]
100%|██████████| 27452/27452 [00:00<00:00, 96072.19it/s]
100%|██████████| 2866/2866 [00:00<00:00, 85867.69it/s]


In [26]:
#function that makes a pretty word frequency plot

def word_plot(words,my_color):
    slist =[]
    for x in words:
        slist.extend(x)
    fig = plt.figure(figsize=(15, 10))
    pd.Series(slist).value_counts()[:20].sort_values(ascending=True).plot(kind='barh',fontsize=20, color=my_color)
    plt.show()


In [27]:
word_plot(popular_words,'blue')


In [28]:
word_plot(popular_bigrams,'orange')


In [29]:
word_plot(popular_trigrams,'red')

### TfidfVectorizer

Tf-idf analyzes the impact of tokens (words) throughout the whole documents. For example, the more times a word appears in a document (each title), the more weight it will have. However, the more documents (titles) the word appears in, it is 'penalized' and the weight is diminished because it is empirically less informative than features that occur in a small fraction of the training corpus ([source](https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments))

* tf(t)= the term frequency is the number of times the term appears in the document
* idf(d, t) = the document frequency is the number of documents 'd' that contain term 't'

#### For example, the word "banana" appears all documents , so its idf is the lowest

In [30]:
txt1 = ['I like banana', 'An apple is not a banana', 'banana banana oh banana']
tf = TfidfVectorizer(smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word')
txt_fitted = tf.fit(txt1)
txt_transformed = txt_fitted.transform(txt1)
print ("The text: ", txt1)

The text:  ['I like banana', 'An apple is not a banana', 'banana banana oh banana']


In [31]:
tf.vocabulary_

{'like': 4, 'banana': 2, 'an': 0, 'apple': 1, 'is': 3, 'not': 5, 'oh': 6}

In [32]:
idf = tf.idf_
print(dict(zip(txt_fitted.get_feature_names(), idf)))
print("\nThe token 'banana' appears 5 times but it is also in all documents, so its idf is the lowest")

{'an': 2.09861228866811, 'apple': 2.09861228866811, 'banana': 1.0, 'is': 2.09861228866811, 'like': 2.09861228866811, 'not': 2.09861228866811, 'oh': 2.09861228866811}

The token 'banana' appears 5 times but it is also in all documents, so its idf is the lowest


In [33]:
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=10000)
word_vectorizer.fit(three_countries.title)
word_features = word_vectorizer.transform(three_countries.title)

classifier_popular = LogisticRegression(C=0.1, solver='sag')
classifier_popular.fit(word_features ,three_countries.popular)


LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='sag',
          tol=0.0001, verbose=0, warm_start=False)

In [34]:
names=['normal','popular']

In [35]:
c_tf = make_pipeline( word_vectorizer,classifier_popular)
explainer_tf = LimeTextExplainer(class_names=names)

exp = explainer_tf.explain_instance(three_countries.title.iloc[10], c_tf.predict_proba, num_features=4, top_labels=1)
exp.show_in_notebook(text=three_countries.title.iloc[10])


In [36]:
exp = explainer_tf.explain_instance(three_countries.title.iloc[4], c_tf.predict_proba, num_features=5, top_labels=1)
exp.show_in_notebook(text=three_countries.title.iloc[4])

In [37]:
exp = explainer_tf.explain_instance(three_countries.title.iloc[10035], c_tf.predict_proba, num_features=5, top_labels=1)
exp.show_in_notebook(text=three_countries.title.iloc[10035])

In [38]:
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.tools as tls


labels = list(three_countries.category_names.value_counts().index.values)
values = list(three_countries.category_names.value_counts().values)

trace = go.Pie(labels=labels, values=values)

iplot([trace], filename='basic_pie_chart')

In [39]:
three_countries.groupby('category_names')['views'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
category_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Autos & Vehicles,310.0,318807.0,302074.6,1858.0,84825.5,272747.0,480935.0,1992282.0
Comedy,2305.0,536622.8,710370.1,1295.0,146885.0,335219.0,648648.0,12924004.0
Education,763.0,285221.8,553764.2,773.0,92603.0,179824.0,326820.5,12100921.0
Entertainment,9730.0,459830.4,1043833.0,733.0,95177.0,211936.5,451687.25,37736281.0
Film & Animation,1431.0,493865.5,1021451.0,943.0,74924.5,168132.0,473297.5,15969920.0
Gaming,966.0,499258.8,908726.5,1237.0,128074.0,233124.0,531995.25,15919643.0
Howto & Style,1780.0,380608.5,646728.8,1107.0,65779.0,158959.0,407440.5,8307705.0
Movies,1.0,225528.0,,225528.0,225528.0,225528.0,225528.0,225528.0
Music,2479.0,1230716.0,3113640.0,1591.0,137488.5,364212.0,976700.0,47669287.0
News & Politics,3414.0,209538.2,373284.4,549.0,47722.25,109291.0,213227.75,5943011.0


## Latent Dirichlet Allocation (LDA) by Category

In [40]:
entertainment_title= three_countries["title"][(three_countries['category_names'] == 'Entertainment')] 
news_politics_title= three_countries["title"][(three_countries['category_names'] == 'News & Politics')] 
people_title= three_countries["title"][(three_countries['category_names'] == 'People & Blogs')] 
music_title= three_countries["title"][(three_countries['category_names'] == 'Music')] 
sports_title= three_countries["title"][(three_countries['category_names'] == 'Sports')] 
comedy_title= three_countries["title"][(three_countries['category_names'] == 'Comedy')] 

In [41]:
vectorizer_entertainment_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
entertainment_title_vectorized = vectorizer_entertainment_title.fit_transform(entertainment_title)
lda_popular_entertainment_title_vectorized = LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
entertainment_title_vectorized_lda = lda_popular_entertainment_title_vectorized.fit_transform(entertainment_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_popular_entertainment_title_vectorized,entertainment_title_vectorized, vectorizer_entertainment_title, mds='tsne')
dash

iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5


In [42]:
vectorizer_news_politics_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
news_politics_title_vectorized = vectorizer_news_politics_title.fit_transform(news_politics_title)
lda_news_politics_title_vectorized= LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
news_politics_title_vectorized_lda = lda_news_politics_title_vectorized.fit_transform(news_politics_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_news_politics_title_vectorized,news_politics_title_vectorized, vectorizer_news_politics_title , mds='tsne')
dash

iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5


In [43]:
vectorizer_people_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
people_title_vectorized = vectorizer_people_title.fit_transform(people_title)
lda_people_title_vectorized= LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
people_title_vectorized_lda = lda_people_title_vectorized.fit_transform(people_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_people_title_vectorized,people_title_vectorized, vectorizer_people_title , mds='tsne')
dash

iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5


In [44]:
vectorizer_music_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
music_title_vectorized = vectorizer_music_title.fit_transform(music_title)
lda_music_title_vectorized= LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
music_title_vectorized_lda = lda_music_title_vectorized.fit_transform(music_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_music_title_vectorized,music_title_vectorized, vectorizer_music_title , mds='tsne')
dash

iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5


In [45]:
vectorizer_sports_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
sports_title_vectorized = vectorizer_sports_title.fit_transform(sports_title)
lda_sports_title_vectorized= LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
sports_title_vectorized_lda = lda_sports_title_vectorized.fit_transform(sports_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_sports_title_vectorized,sports_title_vectorized, vectorizer_sports_title , mds='tsne')
dash

iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5


In [46]:
vectorizer_comedy_title = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
comedy_title_vectorized = vectorizer_comedy_title.fit_transform(comedy_title)
lda_comedy_title_vectorized= LatentDirichletAllocation(n_components=7, max_iter=5, learning_method='online',verbose=True)
comedy_title_vectorized_lda = lda_comedy_title_vectorized.fit_transform(comedy_title_vectorized )

pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_comedy_title_vectorized,comedy_title_vectorized, vectorizer_comedy_title , mds='tsne')
dash

iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
