# Similar Users Lab

BUT FIRST a quick word about strings, lists, and sets:

## Working with sets

In mathematics, a set is a collection of distinct objects.  In Python, "Sets" are lists with no duplicate entries. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference.

_Fun fact for your next party:  Techincally, Python sets are implemented using dictionaries (under the hood)._

Here are two sets of colors:


In [1]:
a = set(["Red", "Green", "Blue"])
b = set(["Black", "White", "Green"])

To find out which items are in both sets (**both sets only**), use the "intersection" method:

In [2]:
a.intersection(b)

{'Green'}

To find the items in a, but not b.

In [3]:
a.difference(b)

{'Blue', 'Red'}

To find the items in b, but not a.

In [4]:
b.difference(a)

{'Black', 'White'}

To find a list of all unique sets (aka: union):

In [5]:
a.union(b)

{'Black', 'Blue', 'Green', 'Red', 'White'}

How many are different?

In [6]:
print "Number of different items in b:  %d" % len(b.difference(a))

Number of different items in b:  2


## From Sets to Lists

Now that we're experts with working with Python sets.  Let's get savvy working with lists and unstructured data.

Using the split() method on a string, we can "split" it by a delimiter, to be used as a list.  By default, the .split() method can be applied to any string object, and will automatically split on spaces.  

*You can pass a parameter to split to change which character it will split on, such as ",", if you're trying to turn a comma seprated list of items into a list.*

The following will turn a space delimited *string* into a **list**.

In [7]:
"my name is dave my name is dave my name is dave".split()

['my',
 'name',
 'is',
 'dave',
 'my',
 'name',
 'is',
 'dave',
 'my',
 'name',
 'is',
 'dave']

What's up with this though?  Well all know "my name is dave", but if we had many values, it would be hard to know which of them are unique.  That's when we use sets.

In [8]:
set("my name is dave my name is dave my name is dave".split())

{'dave', 'is', 'my', 'name'}

Ok so we should know enough to conquer our jaccard distance problem, and step into our real problem:

## Who has similar tastes in music?

What we will attempt, is building a small process that takes feedback from a survey, mapping a distance function to find similar users based on Jaccard.

Along the way we will be:
* Working with requests
* Understanding Python fundamentals with sets and lists
* Cleaning up bad data
* Implementing Jaccard distance function
* Finding similar users

First, we will be taking a survey!  Let's all visit the survey posted in the channel before continuing.

*[Check out #General]*

Hopefully everything goes smooothly.  It's possible that I may need to modify the permissions on the sheet or provide a CSV snapshot if we hit a snag.

We will be loading our results via HTTP, then loading them into Pandas via StringIO which allows us to interoperate on strings as if they were file resources, then load them as a Dataframe.  This is setup for us now.

In [9]:
import pandas as pd
import requests

from StringIO import StringIO  

%matplotlib inline

spreadsheet = "https://docs.google.com/spreadsheets/d/1cpUb7XbN-qOq4xbGdYfhY9FtrMqRd0izz4PmTPMejt0/export?format=csv&id=1cpUb7XbN-qOq4xbGdYfhY9FtrMqRd0izz4PmTPMejt0&gid=216538035"
http = requests.get(spreadsheet)
csv_data = StringIO(http.content)
df = pd.read_csv(csv_data, index_col=0)

In [10]:
df.head(10)

Unnamed: 0_level_0,Name,Favorite Genres / Genres you like,What time of day do you like to listen to music?
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6/29/2016 9:53:54,Kathleen,Electronic Music,
6/29/2016 9:54:36,Stav,"Alternative Music, Country, Dance, Electronic ...","Morning, Noon, Afternoon, Night, Special occas..."
6/29/2016 9:54:39,Courtney,"Blues, Dance, Hip Hop / Rap, Pop, Rock, Singer...","Afternoon, Night"
6/29/2016 9:54:46,Kathleen,"Country, Dance, Electronic Music, Hip Hop / Ra...","Morning, Noon, Afternoon, Night, Special occas..."
6/29/2016 9:54:53,Michael Sanders,"Alternative Music, Blues, Dance, Electronic Mu...",24/7
6/29/2016 9:55:07,Ed,"Alternative Music, Dance, Hip Hop / Rap, Pop","Morning, Night"
6/29/2016 9:55:25,Alec,Metal,24/7
6/29/2016 9:55:25,Justin,"Alternative Music, Hip Hop / Rap, Indie Pop, R...",24/7
6/29/2016 9:55:52,Larry Lizard,Ultra Speed Metal,24/7
6/29/2016 9:55:56,schulzey,"Alternative Music, Dance, Easy Listening, Hip ...",Afternoon


**1. Rename the genre feature**

We get bad data from spreadsheets all the time.  This case, it's coming from a survey.  For ease of reference, rename the feature **"Favorite Genres / Genres you like"** to **"genres"**.


In [11]:
# Renaming the time of day feature for later as well

columns = { 
    "Favorite Genres / Genres you like": "genres", 
    "What time of day do you like to listen to music?": "times"
}

df.rename(columns=columns, inplace=True)

**2. Select only your response from the new "genre" feature**

Try printing out only the first value, where df["Name"] == "[Your name]".

In [12]:
df[df['Name'] == "Sam"]

Unnamed: 0_level_0,Name,genres,times
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6/29/2016 9:56:39,Sam,"Blues, Classical, Dance, Electronic Music, Eur...",24/7


**3. Take your survey response for "genre", and split it into a list, equal to the number of responses you chose**

For example if you chose "Blues, Reggae, Electronic Music", convert it to a list that looks like ["Blues", "Raggae", "Electronic Music"].

In [13]:
# You can use .values or .iloc
# df[df['Name'] == "Dave"]['genres'].iloc[0]
df[df['Name'] == "Sam"]['genres'].values[0].split(" ,")

['Blues, Classical, Dance, Electronic Music, European Music (Folk / Pop), Indie Pop, Jazz, Opera, Pop, Rock, World Music / Beats, Metal, Ultra Speed Metal']

**4. Create a function that takes 2 lists, then calculate Jaccard distance**

0-60 mph I know but you can do this!  Double check our slides, and refer to the set operations for how to calculate this.  

Here is a boilerplate to get you going.

In [14]:
def jaccard(list1, list2):
    
    a = set(list1)
    b = set(list2)
    
    numerator = len(a.intersection(b)) * 1.0 # * 1.0 to cast as a float
    denominator = len(a.union(b)) * 1.0

    return numerator / denominator

list1 = ['blue', 'green', 'yellow']
list2 = ['black', 'orange', 'yellow', 'green']

jaccard(list1, list2)

0.4

**5.  Now for our final trick, calculate the distance between your genre preferences vs everyone else.**

Loop through everyone in the dataframe, create a list out of their "genre" string, echo out their name, then finally the distance between you and their sets.

In [15]:
def apply_jaccard(row):

    if type(row['genres']) is str:        
        user_genres = row['genres'].split(", ")
    else:
        user_genres = []
        
    row['jaccard_distance'] = jaccard(my_genres, user_genres)
    
    return row
    
my_genres =  df[df['Name'] == "Sam"]['genres'].values[0].split(", ")
my_recs   =  df.apply(apply_jaccard, axis=1)


In [16]:
print "Similar users to 'Sam'"
my_recs[['Name', 'jaccard_distance']].sort('jaccard_distance', ascending=False)


Similar users to 'Sam'


  from ipykernel import kernelapp as app


Unnamed: 0_level_0,Name,jaccard_distance
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
6/29/2016 9:56:39,Sam,1.0
6/29/2016 9:55:57,Nori,0.473684
6/29/2016 9:57:51,Dave,0.45
6/29/2016 9:54:53,Michael Sanders,0.411765
6/29/2016 9:54:39,Courtney,0.266667
6/29/2016 9:54:46,Kathleen,0.25
6/29/2016 9:56:20,Jared,0.25
6/29/2016 9:56:54,Nathan,0.222222
6/29/2016 9:54:36,Stav,0.2
6/29/2016 9:55:56,schulzey,0.166667


**Optional 6. Try calculating the distance on the time of day feature.**

Try to make a new dataframe, for just you vs everyone, using jaccard, and time of day.  Is there any interesting patterns you see?

In [17]:
def apply_jaccard_tod(row):

    if type(row['times']) is str:        
        user_times = row['times'].split(", ")
    else:
        user_times = []
        
    row['jaccard_distance'] = jaccard(my_times, user_times)
    
    return row

my_times =  df[df['Name'] == "Sam"]['times'].values[0].split(", ")
my_recs  =  df.apply(apply_jaccard_tod, axis=1)

my_recs[['Name', 'jaccard_distance']].sort('jaccard_distance', ascending=False)




Unnamed: 0_level_0,Name,jaccard_distance
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
6/29/2016 9:57:07,Gary Fastpace,1.0
6/29/2016 9:54:53,Michael Sanders,1.0
6/29/2016 9:58:02,Jerry kuai,1.0
6/29/2016 9:55:25,Alec,1.0
6/29/2016 9:55:25,Justin,1.0
6/29/2016 9:55:52,Larry Lizard,1.0
6/29/2016 10:01:49,Jerold Masrapido,1.0
6/29/2016 9:55:57,Nori,1.0
6/29/2016 9:56:39,Sam,1.0
6/29/2016 10:05:42,Gary Garygargrrrr,1.0


**Optional 7. What can you say about the selection of options for genre or time and what they mean?**

One thing that is pretty obvious is that there are fewer options for times of day.  Times of day is much more broad and may not be a great predictor of personalizable characteristics within the dataset.

Also, options that broadly generalize preferences that already exist in the set that you're collecting is diminishing the preference value.  For instance options such as "24/7", "all", "everything", could describe other options in the same set and don't point to a preference to anything specific.  If you're going to ask explicitly for feedback, then these items will certainly not be very useful.