# Table of Contents

1. [Introduction to the dataset](#intro)
2. [Shape](#shape)
3. [Dealing with missing values](#nan)
4. [Changing the index](#index)
5. [Drop unnecessary columns](#drop)
6. [Artists with most songs on the list](#artists)
7. [Decade with most songs](#year)
8. [Writers with most songs on the list](#writers)
9. [Exploring the month column](#month)
10. [Wordcloud](#cloud)


### 1. Introduction to the dataset

<a id='intro'></a>

*The 500 Greatest Songs of All Time* was the cover story of a special issue of **Rolling Stone** magazine, issue number 963, published in December 2004.

Sources: 

[500 greatest songs - original article](https://www.rollingstone.com/music/music-lists/500-greatest-songs-of-all-time-151127/)

[500 greatest songs - wiki](https://en.wikipedia.org/wiki/Rolling_Stone%27s_500_Greatest_Songs_of_All_Time)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
df = pd.read_csv('../input/500-greatest-songs-of-all-time/Top 500 Songs.csv', encoding='ISO-8859-2')
df.head()

### 2. Shape

<a id='shape'></a>

We know that there supposed to be information about 500 songs in this dataset, which would equal to 500 rows. We check the shape of the df to make sure it is what we expect it to be.
We also display all of the columns names.

In [None]:
print(df.shape)
df.columns

### 3. Dealing with missing values

<a id='nan'></a>

In [None]:
df.isnull().sum()

Surprisignly there are quite a lot of missing values in the dataset. The check revealed that there are about 20% of songs that have unidentified chart positions. We'll explore the rows with null values to see if we find a reason for it.  

In [None]:
df[df.isnull().any(axis=1)]

In [None]:
df.streak.unique()

Looking at the unique values of the "streak" column helped us to unveil that some of the reason as to why 20% of the songs miss chart positions are:
- The song did not chart
- The song is not a single
- The song predates charts appearance

If we want to get some statistical information from the Streak column we need to clean it up. 
First we remove the " week" part of the strings, then we replace missing values with 0, and lastly we transform the dtype of the column into numeric.

In [None]:
# Removing letters from the strings in the Streak column
df['streak'].replace(regex=True, inplace=True, to_replace=r'\D', value=r'')

df.streak.unique()        

In [None]:
# Replacing empty strings and missing values with 0's
df['streak'].replace({'': 0}, inplace=True)
df['streak'] = df['streak'].replace(np.nan, 0)

# Converting Streak column into integer
df['streak'] = pd.to_numeric(df['streak'])
df.streak.unique()

### 4. Changing the index

<a id='index'></a>


We assume for the sake of analysis that the list is ordered and the first song on the list is actually the "ggreatest song of all time" according to the magazine, i.e. it is a number 1 song.
Therefore we change the indexing of the items in the df, by starting the indexing with 1 instead of 0.
We can also add a column which will show the position of the song in the list of 500 greatest songs.

In [None]:
# Changing the index
df.index = df.index + 1

# Adding a new column for the rank
df['rank'] = df.index
df.head()

### 5. Drop unnecessary columns

<a id='drop'></a>

In [None]:
# Dropping some columns that I am not going to use.
df.drop(['description', 'appears on', 'position'], axis=1, inplace=True)
df.head()

### 6. Artists with most songs on the list

<a id='artists'></a>

Now we'll see which artists have most songs on the list.
Exploring only artists that have more than 3 songs.

In [None]:
number_of_songs = df.artist.value_counts() 
print(number_of_songs[number_of_songs > 3].count())

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
sns.countplot(y='artist', data=df, order=df['artist'].value_counts().iloc[:33].index)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)


plt.xlabel("Song Count")
plt.ylabel("")
plt.title("Artists That Have More Than 3 Songs On The List", fontsize=18)
plt.show()

### 7. Decade with most songs

<a id='year'></a>

In which decade were most of the best songs released? In order to answer that question we want to have a separte column for year.

In [None]:
df[['month', 'year']] = df['released'].str.split(", ", expand=True)
df.head()

In [None]:
df.info()

In [None]:
# Converting Year column into integer
df['year'] = pd.to_numeric(df['year'])
df.head()

In [None]:
songs_per_year = df.year.value_counts()
songs_per_year

Woah! Clearly the 60s have revolutionized the music industry according to Rolling Stones!

In [None]:
fig, ax = plt.subplots(1, figsize=(16, 8))
sns.distplot(df['year'], kde=False, color='g')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)

plt.ylabel('Number of Songs')
plt.title('Histogram Showing Which Decade Were Most Songs Released In', fontsize=18)
plt.show()

### 8. Writers with most songs on the list

<a id='writers'></a>

Exploring those with more than 3 songs on the list.

In [None]:
songs_per_writer = df.writers.value_counts()
print(songs_per_writer[songs_per_writer>3])

### 9. Exploring the month column

<a id='month'></a>

In [None]:
df.month.unique()

We can see that months are written down in many different ways. We change them into the standard month names.

*p.s. There probably is a better way to do that, any advice is appreciated.*

In [None]:
for s in df['month']:
    if 'Jan' in s:
        df['month'].replace({s: 'January'}, inplace=True)
    elif 'Feb' in s:
        df['month'].replace({s: 'February'}, inplace=True)
    elif 'Mar' in s:
        df['month'].replace({s: 'March'}, inplace=True)        
    elif 'Apr' in s:
        df['month'].replace({s: 'April'}, inplace=True)        
    elif 'May' in s:
        df['month'].replace({s: 'May'}, inplace=True)        
    elif 'Jun' in s:
        df['month'].replace({s: 'June'}, inplace=True)        
    elif 'Jul' in s:
        df['month'].replace({s: 'July'}, inplace=True)        
    elif 'Aug' in s:
        df['month'].replace({s: 'August'}, inplace=True)        
    elif 'Sep' in s:
        df['month'].replace({s: 'September'}, inplace=True)        
    elif 'Oct' in s:
        df['month'].replace({s: 'October'}, inplace=True)
    elif 'Nov' in s:
        df['month'].replace({s: 'November'}, inplace=True)        
    elif 'Dec' in s:
        df['month'].replace({s: 'December'}, inplace=True)        
        
        
        

In [None]:
df.month.unique()

Now this measure is obviously completely random, and no month is a clear 'loser' in a sense.

In [None]:
#songs_per_month = df.month.value_counts()
#songs_per_month

fig, ax = plt.subplots(1, figsize=(16, 8))
sns.countplot(y='month', data=df, color='g', alpha=0.2)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)

plt.ylabel(' ')
plt.xlabel('Song Count')

plt.title('Histogram Showing Which Month Were Most Songs Released In', fontsize=18)
plt.show()

### 10. Wordcloud

<a id='cloud'></a>

I got this idea of making a wordcloud of most used words in the song titles from Marilia Prata https://www.kaggle.com/mpwolke/500-greatest-songs. That was a neat idea to see what the greatest songs of all time are about.

In [None]:
from PIL import Image
from wordcloud import WordCloud, STOPWORDS

text = " ".join(str(song) for song in df.title)

# Create a mask to use as a shape for the wordcloud
mask = np.array(Image.open('../input/ovalshape/oval_shape.png'))

# Final version with updated stopwords and shape
stopwords = set(STOPWORDS)
stopwords.update(['Dont', 'Da', 'B', 'Aint', 'Got', 'Lotta', 'O', 'Im', 'Bo', 'Ya'])
wordcloud = WordCloud(max_font_size=130, max_words=200, background_color="white", stopwords=stopwords, 
                      mask=mask, colormap='Paired', width=mask.shape[1], height=mask.shape[0]).generate(text)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()