# Homework

In [1]:
import graphlab

In [31]:
song_data = graphlab.SFrame('song_data.gl/')
graphlab.canvas.set_target('ipynb')

**1. Counting unique users:** The method `.unique()` can be used to select the unique elements in a column of data. In this question, you will compute the number of unique users who have listened to songs by various artists. For example, to find out the number of unique users who listened to songs by *Kanye West*, all you need to do is select the rows of the song data where the artist is *Kanye West*, and then count the number of unique entries in the `user_id` column. Compute the number of unique users for each of these artists: *Kanye West, *Foo Fighters*, *Taylor Swift* and *Lady GaGa*.

In [32]:
song_data.head()

user_id,song_id,listen_count,title,artist
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1,The Cove,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Paco De Lucia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1,Stronger,Kanye West
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1,Constellations,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1,Learn To Fly,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODDNQT12A6D4F5F7E,5,Apuesta Por El Rock 'N' Roll ...,Héroes del Silencio
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODXRTY12AB0180F3B,1,Paper Gangsta,Lady GaGa
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFGUAY12AB017B0A8,1,Stacked Actors,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFRQTD12A81C233C0,1,Sehr kosmisch,Harmonia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOHQWYZ12A6D4FA701,1,Heaven's gonna burn your eyes ...,Thievery Corporation feat. Emiliana Torrini ...

song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De Lucia ...
Stronger - Kanye West
Constellations - Jack Johnson ...
Learn To Fly - Foo Fighters ...
Apuesta Por El Rock 'N' Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your eyes - Thievery ...


In [23]:
artists = ['Kanye West', 'Foo Fighters', 'Taylor Swift', 'Lady GaGa']

for artist in artists:
    users_per_artist = song_data[song_data['artist'] == artist]['user_id']
    
    print "Unique users count for %s: %d" % (artist, len(users_per_artist.unique()))

Unique users count for Kanye West: 2522
Unique users count for Foo Fighters: 2055
Unique users count for Taylor Swift: 3246
Unique users count for Lady GaGa: 2928


** 2. Using groupby-aggregate to find the most popular and least popular artist:** each row of `song_data` contains the number of times a user listened to particular song by a particular artist. If we would like to know how many times any song by *Kanye West* was listened to, we need to select all the rows where `'artist’ == 'Kanye West'` and sum the `listen_count` column. If we would like to find the most popular artist, we would need to follow this procedure for each artist, which would be very slow. Instead, you will learn about a very important method:

```python
.groupby()
```

You can read the documentation about `groupby` [here](https://turi.com/products/create/docs/generated/graphlab.SFrame.groupby.html#graphlab.SFrame.groupby). The `.groupby` method computes an aggregate (in our case, the sum of the `listen_count`) for each distinct value in a column (in our case, the `artist` column).

Follow these steps to find the most popular artist in the dataset:

The `.groupby` method has two important parameters:
1. `key_columns`, which takes the column we want to group, in our case, `artist`
2. `operations`, where we define the aggregation operation we using, in our case, we want to sum over the `listen_count`.

With this in mind, the following command will compute the sum `listen_count` for each artist and return an SFrame with the results:

```python
song_data.groupby(
    key_columns='artist', 
    operations={'total_count': graphlab.aggregate.SUM('listen_count')})
```

the total number of listens for each artist will be stored in `total_count`.

Sort the resulting SFrame according to the `total_count`, and find the artist with the most popular and least popular artist in the dataset.


In [46]:
total_count_per_artist = song_data.groupby(
    key_columns='artist', 
    operations={'total_count': graphlab.aggregate.SUM('listen_count')})

comparing_artists = ['William Tabbert', 'Velvet Underground & Nico', 'Kanye West', 'The Cool Kids']
least_popular = ('', 40000)

for artist in comparing_artists:
    row = total_count_per_artist[total_count_per_artist['artist'] == artist]
    count = row['total_count'][0]
    
    if count < least_popular[1]:
        least_popular = (artist, count)
        

print 'The least popular artist is %s with %d' % least_popular 

The least popular artist is William Tabbert with 14


In [47]:
total_count_per_artist.sort('total_count', ascending=False)

artist,total_count
Kings Of Leon,43218
Dwight Yoakam,40619
Björk,38889
Coldplay,35362
Florence + The Machine,33387
Justin Bieber,29715
Alliance Ethnik,26689
OneRepublic,25754
Train,25402
The Black Keys,22184


 ** 3. Using groupby-aggregate to find the most recommended songs:** Now that we learned how to use `.groupby()` to compute aggregates for each value in a column, let’s use to find the song that is most recommended by the `personalized_model` model we learned in the iPython notebook above. Follow these steps to find the most recommended song:

* Split the data into 80% training, 20% testing, using seed=0, as was done in the iPython notebook above.

In [26]:
train_data, test_data = song_data.random_split(.8, seed=0)

* Train an `item_similarity_recommender`, as done in the iPython notebook, using the training data.

In [27]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id='user_id',
                                                                item_id='song')

* Next, we are going to make recommendations for the users in the test data, but there are over 200,000 users (58,628 unique users) in the test set. Computing recommendations for these many users can be slow in some computers. Thus, we will use only the first 10,000 users only in this question. Using this command to select this subset of users:

In [28]:
subset_test_users = test_data['user_id'].unique()[0:10000]

* Let’s compute one recommended song for each of these test users. Use this command to compute these recommendations:

In [34]:
recommended_songs = personalized_model.recommend(subset_test_users, k=1)
recommended_songs

user_id,song,score,rank
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Grind With Me (Explicit Version) - Pretty Ricky ...,0.0459424376488,1
696787172dd3f5169dc94deef 97e427cee86147d ...,Senza Una Donna (Without A Woman) - Zucchero / ...,0.017026577677,1
532e98155cbfd1e1a474a28ed 96e59e50f7c5baf ...,Jive Talkin' (Album Version) - Bee Gees ...,0.0118288653237,1
18325842a941bc58449ee71d6 59a08d1c1bd2383 ...,Goodnight And Goodbye - Jonas Brothers ...,0.0159257985651,1
507433946f534f5d25ad1be30 2edb9a2376f503c ...,Find The Cost Of Freedom - Crosby_ Stills_ Nash & ...,0.0165806589303,1
18fafad477f9d72ff86f7d0bd 838a6573de0f64a ...,Rabbit Heart (Raise It Up) - Florence + The ...,0.0799399726093,1
fe85b96ba1983219b296f6b48 69dd29eb2b72ff9 ...,Secrets - OneRepublic,0.0788827141126,1
225ea420b4bede50919d1bfe2 4a599691522d176 ...,Clocks - Coldplay,0.0271030251796,1
95dc7e2b188b1148b2d25f4e6 b6e94afacc4efc3 ...,Bust a Move - Infected Mushroom ...,0.0534738540649,1
4a3a1ae2748f12f7ab921a47d 6d79abf82e3e325 ...,Isis (Spam Remix) - Alaska Y Dinarama ...,0.04180302118,1


* Finally, we can use `.groupby()` to find the most recommended song! :) When we used `.groupby()` in the previous question, we summed up the total `listen_count` for each artist, by setting the parameter `SUM` in the aggregator:

```python
operations={'total_count': graphlab.aggregate.SUM('listen_count')}
```

For this question, we simply want to count how often each song is recommended, so we will use the `COUNT` aggregator instead of `SUM`, and store the results in a column we will call `count` by using:

```python
operations={'count': graphlab.aggregate.COUNT()}
```

And, since we want to use the song titles as the key to the aggregator instead of of the `artist`, we use:

```python
key_columns='song'
```

* By sorting the results, you will find out the most recommended song to the first 10,000 users in the test data! **Due to randomness in train-test split, the most recommended song may come out differently for different people. This is why we chose not to assign a quiz question for this section.**

In [36]:
recommended_songs.groupby(
    key_columns='song', 
    operations={'count': graphlab.aggregate.COUNT()}).sort('count', ascending=False)

song,count
Undo - Björk,415
Secrets - OneRepublic,381
Revelry - Kings Of Leon,229
You're The One - Dwight Yoakam ...,164
Fireflies - Charttraxx Karaoke ...,110
Sehr kosmisch - Harmonia,107
Hey_ Soul Sister - Train,106
Horn Concerto No. 4 in E flat K495: II. Romance ...,90
OMG - Usher featuring will.i.am ...,60
Bigger - Justin Bieber,43
