<a id="1"></a>

<h1 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Libraires And Utilities</h1>


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import string
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import plotly.express as ex
import plotly.graph_objs as go
import plotly.offline as pyo
from plotly.subplots import make_subplots
pyo.init_notebook_mode()
from sklearn.decomposition import TruncatedSVD,PCA
from sklearn.manifold import Isomap 
nltk.download('vader_lexicon')
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial import distance_matrix
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans,DBSCAN
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from wordcloud import WordCloud,STOPWORDS
from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from nltk.util import ngrams
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import random
import warnings
warnings.filterwarnings("ignore")
plt.rc('figure',figsize=(17,13))

In [None]:
f_data = pd.read_csv('/kaggle/input/notebooks-of-the-week-hidden-gems/kaggle_hidden_gems.csv')
f_data = f_data.rename(columns={'review':'text'})
f_data.head(3)

<a id="1"></a>

<h1 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Preprocessing and Feature Engineering</h1>


In [None]:
#Remove twitter handlers
f_data.text = f_data.text.apply(lambda x:re.sub('@[^\s]+','',x))

#remove hashtags
f_data.text = f_data.text.apply(lambda x:re.sub(r'\B#\S+','',x))


# Remove URLS
f_data.text = f_data.text.apply(lambda x:re.sub(r"http\S+", "", x))

# Remove all the special characters
f_data.text = f_data.text.apply(lambda x:' '.join(re.findall(r'\w+', x)))

#remove all single characters
f_data.text = f_data.text.apply(lambda x:re.sub(r'\s+[a-zA-Z]\s+', '', x))

# Substituting multiple spaces with single space
f_data.text = f_data.text.apply(lambda x:re.sub(r'\s+', ' ', x, flags=re.I))

In [None]:
#Extracting Text Sentiments Using the "VADER" lexicon
sid = SIA()
f_data['sentiments']           = f_data['text'].apply(lambda x: sid.polarity_scores(' '.join(re.findall(r'\w+',x.lower()))))
f_data['Positive Sentiment']   = f_data['sentiments'].apply(lambda x: x['pos']+1*(10**-6)) 
f_data['Neutral Sentiment']    = f_data['sentiments'].apply(lambda x: x['neu']+1*(10**-6))
f_data['Negative Sentiment']   = f_data['sentiments'].apply(lambda x: x['neg']+1*(10**-6))

f_data.drop(columns=['sentiments'],inplace=True)
hg_data = f_data[['date','text','Positive Sentiment','Neutral Sentiment','Negative Sentiment']].copy()

#Number of Words
hg_data['Number_Of_Words'] = hg_data.text.apply(lambda x:len(x.split(' ')))
#Average Word Length
hg_data['Mean_Word_Length'] = hg_data.text.apply(lambda x:np.round(np.mean([len(w) for w in x.split(' ')]),2) )



<a id="1"></a>

<h1 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Exploratory Data Analysis</h1>


In [None]:
plt.subplot(2,1,1)
plt.title('Distriubtion Of Sentiments Across Lyrics',fontsize=19,fontweight='bold')
sns.kdeplot(f_data['Negative Sentiment'],label='Negative Sentiment')
sns.kdeplot(f_data['Neutral Sentiment'] ,label='Neutral Sentiment' ,color='orange' )
sns.kdeplot(f_data['Positive Sentiment'],label='Positive Sentiment',color='tab:red')
plt.legend()
plt.subplot(2,1,2)
plt.title('CDF Of Sentiments Across Lyrics',fontsize=19,fontweight='bold')
sns.kdeplot(f_data['Negative Sentiment'],cumulative=True,label='Negative Sentiment')
sns.kdeplot(f_data['Neutral Sentiment'],cumulative=True,label='Neutral Sentiment' ,color='orange' )
sns.kdeplot(f_data['Positive Sentiment'],cumulative=True ,label='Positive Sentiment',color='tab:red')
plt.xlabel('Sentiment Value',fontsize=19)
plt.legend()
plt.show()

**Observation**: looking at the distribution of sentiments in Martins description of each notebook in his hidden gems posts, we see that the distribution of the negative sentiment is centered around zero with a very low standard deviation (can be seen with the very tall and tight bell around zero) along with to very wide distributions for the positive and neutral sentiments.
We can learn that Martin usually avoids making negative comments about the "Gems" he chooses to post, but at the same time, the deviation in the neutral and positive sentiments may indicate that Martin has a certain pattern of descriptions that tend to be skewed towards some extent.

In [None]:

lyr = hg_data.text.copy()
#lyr = lyr.apply(lambda x:x.replace(',',''))


CV = CountVectorizer(stop_words='english',ngram_range=(1,1))
cv = CV.fit_transform(lyr)
cv_df = pd.DataFrame(cv.toarray(),columns=CV.vocabulary_,index=hg_data.date)
DM = pairwise_distances(cv_df,cv_df,metric='manhattan')
sns.clustermap(DM,cmap='vlag')

**The question** I asked was how similar are Marin's description of the notebooks he posts; after following Martin's hidden gems for a while now, it is clear that he tries to point out the hidden value in each notebook, but how different are his description?

To **answer** the question, all the descriptions written by martin were vectorized, and the L1 norm was calculated between each pair of reviews.

Looking closely at the clustered heatmap, we can clearly see the reviews that resemble each other and the entirely different ones.
Such resemblance between descriptions may be due to the resemblance between some of the ideas in the notebook posted by Martin.


In [None]:
NUMBER_OF_COMPONENTS = 100

stemmer= PorterStemmer()
lemmatizer=WordNetLemmatizer()
SVD = TruncatedSVD(NUMBER_OF_COMPONENTS)

text_data = lyr
text_data = text_data.apply(lambda x: ' '.join([stemmer.stem(word) for word in word_tokenize(x)]))
text_data = text_data.apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(x) if word not in STOPWORDS]))

C_vector = cv

pc_matrix = SVD.fit_transform(C_vector)

evr = SVD.explained_variance_ratio_
total_var = evr.sum() * 100
cumsum_evr = np.cumsum(evr)

trace1 = {
    "name": "individual explained variance", 
    "type": "bar", 
    'y':evr}
trace2 = {
    "name": "cumulative explained variance", 
    "type": "scatter", 
     'y':cumsum_evr}
data = [trace1, trace2]
layout = {
    "xaxis": {"title": "Principal components"}, 
    "yaxis": {"title": "Explained variance ratio"},
  }
fig = go.Figure(data=data, layout=layout)
fig.update_layout(     title='{:.2f}% of the Tweet Text Variance Can Be Explained Using {} Words'.format(np.sum(evr)*100,NUMBER_OF_COMPONENTS))
fig.show()

**The above plot answer the question, how much variation is actually in Martins's descriptions?**
We already observed that based on the L1 norm, there are some visible similarities, but how many words are required to explain at least 80% of the variance in Martin descriptions, apparently 100 terms due to an excellent job.

In [None]:
best_fearures = [[CV.get_feature_names()[i],SVD.components_[0][i]] for i in SVD.components_[0].argsort()[::-1]]
worddf = pd.DataFrame(np.array(best_fearures[:NUMBER_OF_COMPONENTS])[:,0]).rename(columns={0:'Word'})
worddf['Explained Variance'] =  np.round(evr*100,2)
worddf['Explained Variance'] =worddf['Explained Variance'].apply(lambda x:str(x)+'%')
app = []
for word in worddf.Word:
    total_count = 1
    for tweet in text_data:
        if tweet.find(word)!= -1:
            total_count+=1
    app.append(total_count)
worddf['Appeared_On_X_Tweets'] = app
worddf

fig = go.Figure()
fig.add_trace(
    go.Table(
        header=dict(
            values=['<b>Word<b>',"<b>Accountable For X% of Variance<b>",'<b>Appeared On X Reviews<b>'],
            font=dict(size=19,family="Lato"),
            align="center"
        ),
        cells=dict(
            values=[worddf[k].tolist() for k in ['Word',"Explained Variance",'Appeared_On_X_Tweets']],
            align = "center")
    ),
    
)

fig.show()

In [None]:
NUMBER_OF_COMPONENTS = 2
stemmer= PorterStemmer()
lemmatizer=WordNetLemmatizer()
isomap = Isomap(NUMBER_OF_COMPONENTS)

text_data = lyr
text_data = text_data.apply(lambda x: ' '.join([stemmer.stem(word) for word in word_tokenize(x)]))
text_data = text_data.apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(x) if word not in STOPWORDS]))

C_vector = cv

pc_matrix = isomap.fit_transform(C_vector)

dec_df = hg_data.copy()
dec_df = dec_df.assign(pc_1 = pc_matrix[:,0],pc_2 = pc_matrix[:,1])
ex.scatter(dec_df,x='pc_1',y='pc_2',title=r'Martins Comments Projected From R^1124 --> R^2')


Knowing that there are descriptions that tend to be similar and understanding that most of the descriptions variance is centered in only 100 words, the next question that naturally arises is what descriptions can be clustered together and the properties of such clusters.

First, we take a quick look at a low dimensionality representation of the descriptions. We can clearly see that there some fairly dense clusters with a small number of descriptions, but overall, there is a large spread which supports our hypothesis that the similarity is caused by something minor like similar topics and not due to Martins's writing habits.

In [None]:
dbscan = DBSCAN(min_samples=3,eps=0.8)
dbscan.fit(dec_df[['pc_1','pc_2']])
dec_df = dec_df.assign(cluster=dbscan.labels_)
fig = ex.scatter(dec_df,x='pc_1',y='pc_2',title=r'Martins Comments Clustered in R^2',color='cluster')
fig.update_layout(hovermode='y')


We use DBSCAN to extract the cluster formed by descriptions that are located close enough together in the reduced dimension.

In [None]:
wordclouds = []
plt.figure(figsize=(20,10))
for i in range(0,np.max(dec_df.cluster)+1):
    cluster = dec_df.query(f'cluster=={i}')
    WC = WordCloud(background_color='white',width=500,height=500).generate(' '.join(cluster.text))
    wordclouds.append(WC)
    
for i in range(0,np.max(dec_df.cluster)+1):
    plt.subplot(2,np.max(dec_df.cluster)//2+1,i+1)
    plt.imshow(wordclouds[i])
    plt.axis('off')
    plt.title(f'Most Common Words in Cluster : {i}',fontsize=12)
    
plt.tight_layout()

It's pretty interesting to observe and see the patterns emerging in the word clouds of each cluster's descriptions.
For example, cluster 4 is centered around NLP and classifications compared to cluster 0, which deals mainly with analysis and visuals.

In [None]:
dec_df

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as grid_spec
from sklearn.neighbors import KernelDensity

countries = [x for x in range(0,11)]
colors = ['#0000ff', '#3300cc', '#660099', '#990066', '#cc0033', '#ff0000','#612B1F','#9272CA','#8DB23F','#BB3C76','#645FB3']
#dec_df[dec_df.cluster >= 0]
gs = grid_spec.GridSpec(len(countries),1)
fig = plt.figure(figsize=(16,9))

i = 0

ax_objs = []
for cluster in range(0,11):
    x = np.array(dec_df[dec_df.cluster >= 0][dec_df[dec_df.cluster >= 0].cluster == cluster]['Positive Sentiment'])
    x_d = np.linspace(0,1, 1000)

    kde = KernelDensity(bandwidth=0.03, kernel='gaussian')
    kde.fit(x[:, None])

    logprob = kde.score_samples(x_d[:, None])

    # creating new axes object
    ax_objs.append(fig.add_subplot(gs[i:i+1, 0:]))

    # plotting the distribution
    ax_objs[-1].plot(x_d, np.exp(logprob),color="#f0f0f0",lw=1)
    ax_objs[-1].fill_between(x_d, np.exp(logprob), alpha=1,color=colors[i])

    ax_objs[-1].set_xlim(0,1)
    #ax_objs[-1].set_ylim(0,2.5)

    # make background transparent
    rect = ax_objs[-1].patch
    rect.set_alpha(0)

    # remove borders, axis ticks, and labels
    ax_objs[-1].set_yticklabels([])

    if i == len(countries)-1:
        ax_objs[-1].set_xlabel("Positive Sentiment", fontsize=16,fontweight="bold")
    else:
        ax_objs[-1].set_xticklabels([])

    spines = ["top","right","left","bottom"]
    for s in spines:
        ax_objs[-1].spines[s].set_visible(False)

    #adj_country = country.replace(" ","\n")
    ax_objs[-1].text(-0.02,0,f'Cluster: {cluster}',fontweight="bold",fontsize=14,ha="right")


    i += 1

gs.update(hspace=-0.6)

fig.text(0.07,0.85,"Distribution Positive Sentiment in Each Cluster",fontsize=20)

plt.grid(False)
plt.tight_layout()
plt.show()

Observing the ridge plot of the positive sentiment with each cluster, we see that there is no clear patterns and that the positive sentiment in Martins description is not connected to any topic; a possible alternative hypothesis would be that in the case for example if Martin preferred NLP over any other topic, then he would post more NLP related notebooks and make somewhat more positive statements about those notebooks this is in contrast to what we actually see, and that is usually a multimodal distribution in each of our clusters.

In [None]:
mean_df = hg_data.groupby('date').mean()
b_date_mean = mean_df.reset_index()


fig = go.Figure()

for column in ['Mean_Word_Length','Number_Of_Words','Positive Sentiment','Negative Sentiment']:
    fig.add_trace(
        go.Scatter(
            x = b_date_mean.date,
            y = b_date_mean[column],
            name = column,
            mode='lines'
        )
    )
    

btns = []
for x,col in enumerate(['Mean_Word_Length','Number_Of_Words','Positive Sentiment','Negative Sentiment']):
    bol = [False]*12
    bol[x]=True
    d = dict(label = col,
                  method = 'update',
                  args = [{'visible':bol},
                          {'title': 'Distribution of [' +col+'] Over Our Timeline',
                           'showlegend':True}])
    btns.append(d)
    
    
fig.update_layout(title='How Different Text Attributes Change Over The History of "Hidden Gems"',
    updatemenus=[go.layout.Updatemenu(
        active=0,
        showactive=True,
        buttons=btns
        )
    ])

fig.update_xaxes(title_text='Date')
fig.update_yaxes(title_text='Post Mean Value')

fig.show()

The above interactive plots let us observe how different attributes regarding Martin's descriptions change over the lifetime of the "hidden gems" posts.
Looking closely at positive sentiment and its deviation through time, it seems there was a slight decline and another incline around December of 2020.



In [None]:
fig = make_subplots(rows=4, cols=2, subplot_titles=('Observed Pos', 'Observed Neg', 'Trend Pos','Trend Neg','Seasonal Pos','Seasonal Neg','Residual Pos','Residual Neg'))
lbl = ['Positive','Negative']

for idx,column in enumerate(['Positive Sentiment','Negative Sentiment']):
    res = seasonal_decompose(b_date_mean[column], period=4, model='additive', extrapolate_trend='freq')
    
    fig.add_trace(
    go.Scatter(x=b_date_mean.date, y=res.observed,name='{} Observed'.format(lbl[idx])),
    row=1, col=idx+1)
    
    fig.add_trace(
    go.Scatter(x=b_date_mean.date, y=res.trend,name='{} Trend'.format(lbl[idx])),
    row=2, col=idx+1)
    
    fig.add_trace(
    go.Scatter(x=b_date_mean.date, y=res.seasonal,name='{} Seasonal'.format(lbl[idx])),
    row=3, col=idx+1)
    
    fig.add_trace(
    go.Scatter(x=b_date_mean.date, y=res.resid,name='{} Residual'.format(lbl[idx])),
    row=4, col=idx+1)
            
fig.update_layout(height=600, width=900, title_text="Decomposition Of Our Sentiments into Trend,Level,Seasonality and Residuals")
fig.show()

Treating the positive sentiment of Martin's descriptions as our signal and decomposing it into three components (trend, seasonal and residual) based on a period value of 4 (each month), we see that indeed there was somewhat a trend on a decline in July of 2020 and hitting is minima at December of 2020 after which an incline is observed, 
I wonder what happened in those months that show such underlying behavior.

It could be an interesting experiment to predict the positive sentiments of future hidden gems descriptions, but due to the non-stationary nature of the sentiments, this is not possible.


**Closing Note**: I personally like to thank Martin for bringing to light quality notebooks every week, helping new members of this amazing community get their hard word noticed, getting feedback and reviews that help them grow in the vast field known as data science.