<a id="1.1"></a>
<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Introduction</h3>

![](https://images.unsplash.com/photo-1591025810539-a321000cda85?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1350&q=80)


Many key features turn a song into a popular and highly rated one, but what features show the most significance when it comes to learning to play that song on a guitar.
What drives guitar players the most to take the time and learn the song they love, is it the difficulty of the song? or maybe it is the tuning? 
In the following Kernel we will explore the data and try to uncover hidden patterns, try and find out can we predict the amount of song 'Hits' a tab will get depending on other features in our data set.

Before diving into the analysis lets first clearly define all our goals.

## Question We Will Investigate
1) What are the most popular artists in our data?

2) What Key has the highest ratings?

3) What Tuning has the highest rating?

4) How does the capo position affect the ratings

5) What effect do tunings have on the average ratings?

6) Are there any word which shows up in most of the song names?




<a id="1.1"></a>
<h3 style="background-color:orange;font-family:newtimeroman;font-size:200%;text-align:center;border-radius: 15px 50px;">Libraries And Utilities</h3>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as ex
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import plotly
plotly.offline.init_notebook_mode (connected = True)
import seaborn as sns
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from wordcloud import WordCloud,STOPWORDS
from sklearn.preprocessing import LabelEncoder

sns.set_style('darkgrid')

plt.rc('figure',figsize=(20,11))

In [None]:
g_data = pd.read_csv('/kaggle/input/top-850-guitar-tabs/gutiarDB.csv')
g_data.head(3)

In [None]:
#preprocessing

g_data['Song Rating'] = g_data['Song Rating'].apply(lambda x: int(''.join(x.split(','))))
g_data['Song Hits'] = g_data['Song Hits'].apply(lambda x: int(''.join(x.split(','))))
g_data.Capo = g_data.Capo.replace({' 6th fre':' 6th fret',' 2nd fre':' 2nd fret',' 5th fre':' 5th fret',
                                  ' 7th fre':' 7th fret',' 3rd fre':' 3rd fret',' 1st fre':' 1st fret',
                                  ' 4th fre':' 4th fret'})

g_data['Difficulty'].replace({'advance':'advanced','intermediat':'intermediate','novic':'novice'},inplace=True)

diff_dum = pd.get_dummies(g_data['Difficulty'],prefix='Difficulty')
diff_dum.drop(columns=['Difficulty_intermediate'],inplace=True)
g_data = pd.concat([g_data,diff_dum],axis=1)
g_data.drop(columns=['Difficulty'],inplace=True)



g_data.head(3)

<a id="1.1"></a>
<h3 style="background-color:orange;font-family:newtimeroman;font-size:250%;text-align:center;border-radius: 15px 50px;">Exploratory Data Analysis</h3>

In [None]:
ax = sns.countplot(g_data['Page Type'])
ax.set_title('Distribution Of Different Page Types',fontsize=18)
ax.set_xticklabels(ax.get_xticklabels(),fontsize=14)
plt.show()

In [None]:
fig=ex.scatter(x=g_data['Song Hits'],y=g_data['Song Rating'],color=g_data['Page Type'])
fig.update_layout(xaxis_title='Number Of Song Hits',yaxis_title='Song Rating Count')
fig.show()

### We can see that most of our song pages are 'chords' and a much less are actuacl 'tabs' 

In [None]:
plt.subplot(2,1,1)
ax = sns.kdeplot(g_data['Song Hits'])
ax.set_xlabel("Number Of Views",fontsize=20)
ax.set_ylabel("Density",fontsize=20)
textstr = '\n'.join(
        (r'$\mu=%.2f$' % (g_data['Song Hits'].mean(),), r'$\mathrm{median}=%.2f$' % (g_data['Song Hits'].median(),),
         r'$\sigma=%.2f$' % (g_data['Song Hits'].std(),)))
props = dict(boxstyle='round', facecolor='green', alpha=0.5)
ax.text(0.75, 0.95, textstr, transform=ax.transAxes, fontsize=14,
            verticalalignment='top', bbox=props)

ax.set_title('Distribution Of View Coutns Across Our Samples',fontsize=21)
plt.show()

plt.subplot(2,1,2)
g_data['Song Hits'] =np.log(g_data['Song Hits'])
ax = sns.kdeplot(g_data['Song Hits'])
ax.set_xlabel("Number Of Views",fontsize=20)
ax.set_ylabel("Density",fontsize=20)
textstr = '\n'.join(
        (r'$\mu=%.2f$' % (g_data['Song Hits'].mean(),), r'$\mathrm{median}=%.2f$' % (g_data['Song Hits'].median(),),
         r'$\sigma=%.2f$' % (g_data['Song Hits'].std(),)))
props = dict(boxstyle='round', facecolor='green', alpha=0.5)
ax.text(0.75, 0.95, textstr, transform=ax.transAxes, fontsize=14,
            verticalalignment='top', bbox=props)

ax.set_title('Distribution Of View Coutns Across Our Samples (After Log Transformation)',fontsize=21)
plt.show()


### The number of views has some outliers or 'black swans' we can see that our distribution is positively skewed and the distribution curve itself is quite narrow around the mode. 

In [None]:
plt.subplot(2,1,1)
ax = sns.kdeplot(g_data['Song Rating'])
ax.set_xlabel("Number Of Views",fontsize=20)
ax.set_ylabel("Density",fontsize=20)
textstr = '\n'.join(
        (r'$\mu=%.2f$' % (g_data['Song Rating'].mean(),), r'$\mathrm{median}=%.2f$' % (g_data['Song Rating'].median(),),
         r'$\sigma=%.2f$' % (g_data['Song Hits'].std(),)))
props = dict(boxstyle='round', facecolor='green', alpha=0.5)
ax.text(0.75, 0.95, textstr, transform=ax.transAxes, fontsize=14,
            verticalalignment='top', bbox=props)

ax.set_title('Distribution Of View Coutns Across Our Samples',fontsize=21)
plt.show()

g_data['Song Rating'] = np.log(g_data['Song Rating'])

plt.subplot(2,1,2)
ax = sns.kdeplot(g_data['Song Rating'])
ax.set_xlabel("Number Of Views",fontsize=20)
ax.set_ylabel("Density",fontsize=20)
textstr = '\n'.join(
        (r'$\mu=%.2f$' % (g_data['Song Rating'].mean(),), r'$\mathrm{median}=%.2f$' % (g_data['Song Rating'].median(),),
         r'$\sigma=%.2f$' % (g_data['Song Hits'].std(),)))
props = dict(boxstyle='round', facecolor='green', alpha=0.5)
ax.text(0.75, 0.95, textstr, transform=ax.transAxes, fontsize=14,
            verticalalignment='top', bbox=props)

ax.set_title('Distribution Of View Coutns Across Our Samples (After Log Transformation)',fontsize=21)
plt.show()

### Similarly to the number of views, the number of ratings is positively skewed with large outliers which cause the skewness 

In [None]:
capo_encoder = LabelEncoder()
page_type_encoder = LabelEncoder()
key_encoder  = LabelEncoder()
tuning_encoder  = LabelEncoder()
difficulty_encoder  = LabelEncoder()

ge_data = g_data.copy()

ge_data['Capo'] =  capo_encoder.fit_transform(g_data['Capo'])
ge_data['Page Type'] =  page_type_encoder.fit_transform(g_data['Page Type'])
ge_data['Key'] =  key_encoder.fit_transform(g_data['Key'])
ge_data['Tuning'] =  tuning_encoder.fit_transform(g_data['Tuning'])

ge_data.head(3)

In [None]:
artists = ge_data.groupby(by='Artist').count()
artists = artists.sort_values(by= 'Song Name',ascending=False)
artists = artists[:30]
artists = artists.rename(columns={'Song Name':'Number Of Songs'})
ex.pie(artists,values='Number Of Songs',names=artists.index,title='Top 30 Artists')

### We see that 8.5% of the song in our list are taylor swifts and ed sheerans.

In [None]:
gez_data = ge_data.copy()
gez_data['Key'] = key_encoder.inverse_transform(ge_data['Key'])
gez_data.head(3)

In [None]:
artists = gez_data.groupby(by='Key').count()
artists = artists.sort_values(by= 'Artist',ascending=False)
artists = artists[:5]
artists = artists.rename(columns={'Song Name':'Number Of Songs'})
ex.pie(artists,values='Number Of Songs',names=artists.index,title='Top 30 Keys')


In [None]:
ex.pie(g_data,names='Capo',title='Proportions of songs with a certain capo requirement')

### Almost 50% of the songs in our data set are in Db key!

In [None]:
stopwords = list(STOPWORDS)

words = ''
for name in ge_data['Song Name']:
    tokens = name.lower().split(' ')
    words += ' '.join(tokens)+' '


wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = stopwords, 
                min_font_size = 10).generate(words) 
  
plt.figure(figsize = (25, 15), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show() 

### It seems that the most common words in our song names are positive and happy we will test the sentiment of the 30 most common words as well to make a clear statement.We also see the word 'acoustic' as the most common word which is not surprising as many songs have an acoustic version and it may appear in many of or songs as a part of there name.

In [None]:
sid = SentimentIntensityAnalyzer()

word_count = WordCloud().process_text(words)
word_count =  {k: v for k, v in sorted(word_count.items(), key=lambda item: item[1])}
word_list = list(word_count.items())[-30:]
word_list = [word for word,count in word_list]
sent = ' '.join(word_list)
pscores = sid.polarity_scores(sent)
dfs = pd.DataFrame(pscores,index=[1])
dfs = dfs.T
dfs= dfs.reset_index()
dfs = dfs.rename(columns={'index':'Type',1:'Value'})
dfs = dfs.drop(3)
ex.line_polar(dfs,r='Value',theta='Type',line_close=True)

### The top 30 words in our song names range between neutral sentiment and positive where neutral is more dominant but the negative sentiment is at zero, meaning there are no negative or sad named songs in most of our list.

In [None]:
def get_pos_sentiment(sir):
    return sid.polarity_scores(sir)['pos']
def get_neg_sentiment(sir):
    return sid.polarity_scores(sir)['neg']
def get_neu_sentiment(sir):
    return sid.polarity_scores(sir)['neu']

ge_data['Positive_Sentiment'] = ge_data['Song Name'].apply(get_pos_sentiment)
ge_data['Negative_Sentiment'] = ge_data['Song Name'].apply(get_neg_sentiment)
ge_data['Neutral_Sentiment'] = ge_data['Song Name'].apply(get_neu_sentiment)
ge_data['Song_Name_Length'] = ge_data['Song Name'].apply(lambda x : len(x))

In [None]:
fig = make_subplots(rows=2, cols=1,shared_xaxes=True,subplot_titles=('Perason Correaltion',  'Spearman Correaltion'))
colorscale=     [[1.0              , "rgb(165,0,38)"],
                [0.8888888888888888, "rgb(215,48,39)"],
                [0.7777777777777778, "rgb(244,109,67)"],
                [0.6666666666666666, "rgb(253,174,97)"],
                [0.5555555555555556, "rgb(254,224,144)"],
                [0.4444444444444444, "rgb(224,243,248)"],
                [0.3333333333333333, "rgb(171,217,233)"],
                [0.2222222222222222, "rgb(116,173,209)"],
                [0.1111111111111111, "rgb(69,117,180)"],
                [0.0               , "rgb(49,54,149)"]]

s_val =ge_data.corr('pearson')
s_idx = s_val.index
s_col = s_val.columns
s_val = s_val.values
fig.add_trace(
    go.Heatmap(x=s_col,y=s_idx,z=s_val,name='pearson',showscale=False,xgap=1,ygap=1,colorscale=colorscale),
    row=1, col=1
)


s_val =ge_data.corr('spearman')
s_idx = s_val.index
s_col = s_val.columns
s_val = s_val.values
fig.add_trace(
    go.Heatmap(x=s_col,y=s_idx,z=s_val,xgap=1,ygap=1,colorscale=colorscale),
    row=2, col=1
)

fig.update_layout(height=700, width=900, title_text="Locations That Contribute The Most To Our Cut-Offs")
fig.show()

### We see a high but not surprising correlation between the number of views on a certain song tab and the number of ratings, but the smaller correlations reveal that maybe there is some interesting behavior which indeed needs to be tested for example we see that the difficulty of the song is correlated with the songs rating as well as page type with difficulty.