## Introduction ##
The Rolling Stone's 500 Greatest Albums ever, a comprehensive list of some of the most influential albums spanning over 7 decades. This is the ultimate who's who of music . . .

## First Look at the Data ##
Let's load up the data and take a look at what we are working with here.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("../input/albumlist.csv", encoding = 'latin1') # I've added that 'encoding' bit just to fix the unicode decoding error in python 3
df.loc[0:20]

## A little housekeeping ##
At first sight, the genre field seems to have too much stuff in it. A whole host of commas and slashes. Let's take care of that a bit. 
What I'm going to do is restrict the number of genres per album to 1. Now, I know that approach may not seem musically open-minded enough, but I'm going to stick with it anyway.

In [None]:
# Cleaning up genres
split_genre = []
for s in  df["Genre"]:
    split_genre.append(s.split(",")[0]) # Split every genre field entry at the comma
df["Genre"] = split_genre               # and only use the first genre specified
df.loc[0:20]

Alright. Much Better.
Also, it would be useful to add a column that holds the decade of the album release. I will be using this later.

In [None]:
# Adding decades column
newyears = []
for year in df["Year"]:
    if year < 1960:
        newyears.append("50s")
    elif year < 1970:
        newyears.append("60s")
    elif year < 1980:
        newyears.append("70s")
    elif year < 1990:
        newyears.append("80s")
    elif year < 2000:
        newyears.append("90s")
    elif year < 2010:
        newyears.append("00s")
    else:
        newyears.append("10s")
df["Decade"] = newyears
sorter = ["50s", "60s", "70s", "80s", "90s", "00s", "10s"]
df["Decade"] = pd.Categorical(df["Decade"], sorter)
df = df.sort_values("Decade")
df.head()

## 1. Top 20 Artists with the most Entries ##
To start off, let's see who are the most represented artists in this list. First, let's group the data by artist name using the pandas "groupby" function.

In [None]:
group_artist = df.groupby("Artist")
ser = group_artist.size()

Here, the "size" function returns the returns the number of entries for every group, in this case, for each artist.
Next, we need to convert this series into a dataframe and sort them in descending order of number of entries. After that, we simply print the top 20. 

In [None]:
byartist = pd.DataFrame({"Artist":ser.index, "Count":ser.values})
topartists = byartist.sort_values("Count", ascending = False).iloc[0:20]
topartists.index = [x for x in range(1, 21)] # Reset the numbering to start from 1
topartists

No surprises there. The Beatles leading the pack.

## 2. Most featured Genres ##
In a similar way, let's now see which genre has the most number of inductees in the list. 

In [None]:
genre_series = df.groupby("Genre").size()
bygenre = pd.DataFrame({"Genre":genre_series.index, "No. of Entries":genre_series.values})
topgenres = bygenre.sort_values("No. of Entries", ascending = False)
topgenres.index = [x for x in range(1, 12)]
topgenres

No prizes for guessing what the most popular genre in the list is. \m/

## 3. Most popular genre in each decade ##
Let's take it up a notch. Now, we'll try to find out which genre is most represented in each decade using a barplot, because we can. For this, I'll use the "Seaborn" graphing package. But first, we must group the dataframe by decade and genre.

In [None]:
ser = df.groupby(["Decade", "Genre"]).size()
years = []
genres = []
for x in ser.index:
    years.append(x[0])
    genres.append(x[1])
byyear = pd.DataFrame({"Decade":years, "Genre":genres, "Count":ser.values})
# Order Decade chronologically
sorter = ["50s", "60s", "70s", "80s", "90s", "00s", "10s"]
byyear["Decade"] = pd.Categorical(byyear["Decade"], sorter)
byyear = byyear.sort_values("Decade")

The groupby function returns a pandas series, whose indices are the columns by which we group all the data. In this case, we are grouping the data by both decade and genre. So, the series "ser" has a multi-level index, with "Decade" as the highest level and "Genre" as the lower level. When creating the dataframe "byyear", we must pass them separately as arguments as otherwise, only the number of entries would get added to the dataframe.  

Now let's get on with the plotting. In seaborn, it is quite easy to create successive barplots. But, it has a dearth of text formatting options. Luckily, matplotlib options can run on top of a seaborn plot. So, we will use seaborn to plot the graph, and matplotlib to add the title, axis labels and the legend.

In [None]:
import seaborn as sns
fig = plt.figure(figsize = (20, 20))
sns.set_style("whitegrid")
ax = sns.factorplot(x="Decade", y="Count", hue = "Genre", aspect = 2,
                    data=byyear, kind = "bar", size = 10, 
                    palette = sns.color_palette("hls", 8), legend = False)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), prop={'size':15})
plt.title("Most Popular Genres by Decade", fontsize = 20)
plt.xlabel("Decade", fontsize = 17)
plt.ylabel("Number of Entries", fontsize = 17)

## 4. Most common sub-genres among Rock albums ##
Since Rock albums hold a lion's share of the entries, let's try to dissect these a bit more.

Let's get the most common sub-genres among rock albums.

But before that, we must clean up the sub-genre field a bit. Each album has around 3 or 4 sub-genres. So, I'm going to write a function that splits each sub-genre entry by the comma, and successively appends them to a list.

In [None]:
rock = df[df.Genre == "Rock"]
# Now to split each subgenre into separate strings and add them to a list
def split_sub(sub):
    res = []
    for x in sub:
        spl = x.split(", ") # To separate the sub-genres using comma as the delimiter
        for s in spl:       # into 'spl', which is a list
            res.append(s)
    return res

subgenres = split_sub(list(rock["Subgenre"]))

Now that that's done, we must now count the number of entries for each sub-genre in the list "subgenres". For this, we use the "collections" package.

In [None]:
import collections as co

count = co.Counter(subgenres)
topsub_list = count.most_common() # 'topsub_list' is now a list of tuples
topsub = pd.DataFrame(topsub_list, columns = ["Subgenre", "Number of entries"])
topsub.index = [x for x in range(1, len(topsub) + 1)]
topsub.loc[0:20]

Pop Rock? 

## 5. Most featured artist in each decade ##
Lastly, let's find out the top 5 artists in each decade, according to this list.

In [None]:
art_ser = df.groupby(["Decade", "Artist"]).size()
temp = art_ser.to_frame()
temp = temp.rename(columns = {0:'No. of Entries'})
art_dec = temp['No. of Entries'].groupby(level = 0, group_keys = False)
art_dec.nlargest(5)

## Conclusion ##
So that's some preliminary analyses that I thought would prove useful. I would love to hear what you think of it in the comments section. Leave an upvote if you like it!!