**  <font size="5"> Basic Analysis of Top 100 Spotify Tracks of 2018</font>**

<font size="3">**Questions:** </font>
1. Are there any patterns between the given music features and tracks' ranking? 
2. Are there any notable correlations among the given music features?
3. Who appeared in the top 100 most?

**1. Loading libraries and dataset**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import os
import seaborn as sns

file_name = '../input/top2018.csv' # change this if you want to read a different dataset
df = pd.read_csv(file_name)
df.index = np.arange(1,len(df)+1)
#create a "ranking" column
df = df.reset_index().rename(columns = {"index":"ranking"})
df.head()

**2. Checking the dataset info for any missing values and reasons to "clean it up"**

In [None]:
#Number of rows and columns in a dataset
df.shape

In [None]:
#Number of null values in each column
df.isnull().sum()

In [None]:
#Range, column, number of non-null objects of each column, datatype and memory usage
df.info()

In [None]:
#Number of non null values in each column.
df.count()

In [None]:
#descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.
df.describe()

**3. Analyzing to answer the questions **

1. Are there any patterns between the given music features and tracks' ranking? 

> Plotting music features with respect to the ranking: (Feature) vs. Ranking

* danceability, energy, speechiness, acousticness,  instrumentalness, liveness, valence: 0-1
* Loudness (dB):  -60 and 0
* key: 0=C, 1=C♯, 2=D, 3=D#, 4=E, 5=F, 6=F# , 7=G, 8=G#,9=A, 10=A#, 11=B 
* tempo: bpm
* duration: milliseconds
* time_signature:

In [None]:
fig, axs = plt.subplots(6, 2, constrained_layout=True, figsize=(17, 17))
def make_a_plot(column, position):
    axs[position].scatter(df['ranking'], df[column])
    axs[position].set_title(column)
    axs[position].set_xlabel('Ranking')
    axs[position].set_ylabel(column)

columns_plot = ['danceability','energy','key','loudness','speechiness','acousticness','instrumentalness','liveness','valence','tempo','duration_ms','time_signature']
j=[0,0,0,0,0,0,1,1,1,1,1,1]
k=[0,1,2,3,4,5,0,1,2,3,4,5]
iters = list(zip(columns_plot, j, k))
for col, i,l in iters:
    make_a_plot(col, (l,i))

2. Are there any notable correlations among the given music features?

> Plotting correlation matrix

In [None]:
keep_columns = ['danceability','energy','loudness','speechiness','acousticness','instrumentalness','liveness','valence','tempo','duration_ms'] # only looking at correlations between these variables
corr_matrix = df[keep_columns].corr()
corr_matrix

In [None]:
fig, axis = plt.subplots(figsize=(10, 10))
axis = sns.heatmap(
    corr_matrix, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
axis.set_xticklabels(
    axis.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)
    

3. Who appeared in the top 100 most?

> Plotting horizontal bar graph of artists with his/her # of tracks appeared in the top 100 list

In [None]:
top_artists = df.groupby('artists').id.count().sort_values(ascending=False).iloc[:19]
top_artists.head()
fig4, ax = plt.subplots(figsize=(10, 10))
ax = top_artists.plot.barh()
ax.set_title('Artists With Most Songs in top 100')
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Number of Apperances')
plt.show()

**4. Summary **

 1. Are there any patterns between the given music features and tracks' ranking? 
> There was no apparent pattern between the each music feature and track's ranking. This probably would mean that the given music features cannot be used to predict track's ranking. There must be other less technical and more other features with which we could use to predict ranking.
    
2. Are there any notable correlations among the given music features?
> Important thing to keep in mind: Corelation vs Causation -> if there is any correlation, it doesn't necessarily mean that the features are related 
> positive correlatoins:
> * loudness vs energy -> the louder, the more energetic
> * danceability, loudness, energy vs valence -> the more positive, the more dancable, louder, more energetic
> * speechiness vs danceability -> the more relatively words, the more dancable
> 
> negative correlations:
> * acoustcness vs energy -> the less acoustic, the more energetic
> * acousticness, speechiness vs loudness -> the less acoustic and less words, the more loud 
> * tempo vs danceability -> the higher the tempo, the less dancable the song
>
> These are the most apparent ones. Personally, I think "loudness vs energy"; "danceability, loudness, energy vs valence"; "acoustcness vs energy" are most reasonable correlations.
3. Who appeared in the top 100 most?
> Not surprisingly, the top three artists with most songs in the list are:
> 1. XXXTENTACION: 6 tracks
> 2. Post Malone: 6 tracks
> 3. Drake: 4 tracks

**5. What could have been done?**

> I think, with more data on:
> * the number of streams
> * genre
> * some metrics or score of artists' fame
> * some tags associated with the song
>
> There would have been a chance to make much more insightful analysis with more details and possibly we could even know the predictor factors.