## Part A

A list of user IDs, along with the number of distinct songs each user has played.

Let's check version of Python and Pandas

In [3]:
import platform

In [4]:
print('Python version')
platform.python_version()

Python version


'3.6.1'

In [5]:
import pandas as pd

In [6]:
print('Pandas version')
pd.__version__

Pandas version


'0.20.1'

### Reading in the file

I had played around with the file, and I saw that there were no headers, so I have put in the headers in the command as below.

I noticed that the expected row count in the file is around 12 million. I checked this with UNIX as well, and it confirmed (when I tried to open the files in Windows using Notepad++ or Sublime, I got nowhere. The file size was too big for them to  handle). 

When I read in the file  in pandas, it read in only around 800k rows. May be there is some limits with large file size in Pandas or may be I need to upgrade my laptop spec.

I will go ahead and tackle the exercises with the 800k as it is a subset of the 12 million. The logic applied to 800k will be same as the one that will be applied to 12 million. I hope this is alright.

As we will see below, we get a dataset for 33 users. In the original file, there were around 1000 users.

In [7]:
df = pd.read_table('Files/userid-timestamp-artid-artname-traid-traname.tsv',
                   names=['userid', 'timestamp', 'artid', 'artname', 'traid', 'traname'],
                   header=None)

Let's take a look at the first few lines.

The column names are displayed as expected. So far so good.

In [8]:
df.head()

Unnamed: 0,userid,timestamp,artid,artname,traid,traname
0,user_000001,2009-05-04T23:08:57Z,f1b1cf71-bd35-4e99-8624-24a6e15f133a,Deep Dish,,Fuck Me Im Famous (Pacha Ibiza)-09-28-2007
1,user_000001,2009-05-04T13:54:10Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Composition 0919 (Live_2009_4_15)
2,user_000001,2009-05-04T13:52:04Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Mc2 (Live_2009_4_15)
3,user_000001,2009-05-04T13:42:52Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Hibari (Live_2009_4_15)
4,user_000001,2009-05-04T13:42:11Z,a7f7df4a-77d8-4f12-8acd-5c60c93f4de8,坂本龍一,,Mc1 (Live_2009_4_15)


Let's take a look at the last few lines.

The last few lines show rows for user 33, and as we can see, the row count is around 800k. 

In [9]:
df.tail()

Unnamed: 0,userid,timestamp,artid,artname,traid,traname
835868,user_000033,2007-05-24T20:07:13Z,b5fb0f1a-2cd4-4759-9d80-648643f70c43,シートベルツ,6bd67df3-5acc-4325-a0f2-4fddf4b0b7d8,No Reply
835869,user_000033,2007-05-24T20:04:53Z,4d49a36d-76a4-48b1-b1ae-9a94bc980345,Ferenc Snétberger,,Kelas(Let'S-Dance)[Instrumental Version]
835870,user_000033,2007-05-24T20:01:50Z,4d49a36d-76a4-48b1-b1ae-9a94bc980345,Ferenc Snétberger,,Kelas(Let'S-Dance)
835871,user_000033,2007-05-24T19:56:32Z,34cf95c7-4be9-4efd-a48a-c2ea4a0bb114,America,,That'S All I'Ve Got To Say
835872,user_000033,2007-05-24T19:50:25Z,,~8+,,Ťč


The below command gives an overall view of the dataset. 

This is useful to see the row count and to get an idea of null values in the dataset.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 835873 entries, 0 to 835872
Data columns (total 6 columns):
userid       835873 non-null object
timestamp    835873 non-null object
artid        810829 non-null object
artname      835873 non-null object
traid        733552 non-null object
traname      835873 non-null object
dtypes: object(6)
memory usage: 38.3+ MB


The below command further describes the dataset.

As we can see, there are 33 unique users. Artist ID ('artid') has less rows than 'artname' (Artist Name), so it is saying that some Artists do not have their IDs. Similarly, some Track names do not have their IDs (traid is less than traname).

In [11]:
df.describe()

Unnamed: 0,userid,timestamp,artid,artname,traid,traname
count,835873,835873,810829,835873,733552,835873
unique,33,830774,16947,20190,111829,116682
top,user_000012,2009-02-11T12:14:38Z,164f0d73-1234-4e2c-8743-d77bf2191051,Kanye West,82558949-cd98-4c58-af35-3f1a9430d52e,Heartless
freq,75876,33,27115,27115,2069,2121


### Number of distinct songs played

Let's define a function to perform grouping of distinct songs by userid.

I want to group them by track names and track ids. They would give slightly different results as some tracks do not have IDs.

In [12]:
def groupDistinctSongs(df, col):
    grouped = df.groupby(['userid'])[col].nunique()
    df_grouped = pd.DataFrame(grouped.reset_index(name='Songs Played'))
    df_grouped.rename(columns={'userid': 'User ID'}, inplace=True)
    return df_grouped

### by Track Name

Here I group by track names. 

User 01 played 3092 distinct songs, User 02 played 8129 songs, and so on.

In [13]:
groupDistinctSongs(df, 'traname')

Unnamed: 0,User ID,Songs Played
0,user_000001,3092
1,user_000002,8129
2,user_000003,4565
3,user_000004,5974
4,user_000005,1974
5,user_000006,7733
6,user_000007,1093
7,user_000008,608
8,user_000009,2555
9,user_000010,874


### by Track ID

Grouping by track ids, we find that the numbers are slightly smaller as expected. Some tracks do not have IDs.

I did this as an extra step.

In [14]:
groupDistinctSongs(df, 'traid')

Unnamed: 0,User ID,Songs Played
0,user_000001,2325
1,user_000002,7701
2,user_000003,4075
3,user_000004,5409
4,user_000005,1378
5,user_000006,4958
6,user_000007,644
7,user_000008,416
8,user_000009,2333
9,user_000010,603
