# Similar Users Lab

BUT FIRST a quick word about strings, lists, and sets:

## Working with sets

In mathematics, a set is a collection of distinct objects.  In Python, "Sets" are lists with no duplicate entries. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference.

_Fun fact for your next party:  Techincally, Python sets are implemented using dictionaries (under the hood)._

Here are two sets of colors:


In [1]:
a = set(["Red", "Green", "Blue"])
b = set(["Black", "White", "Green"])

To find out which items are in both sets (**both sets only**), use the "intersection" method:

In [2]:
a.intersection(b)

{'Green'}

To find the items in a, but not b.

In [3]:
a.difference(b)

{'Blue', 'Red'}

To find the items in b, but not a.

In [4]:
b.difference(a)

{'Black', 'White'}

To find a list of all unique sets (aka: union):

In [5]:
a.union(b)

{'Black', 'Blue', 'Green', 'Red', 'White'}

How many are different?

In [6]:
print "Number of different items in b:  %d" % len(b.difference(a))

Number of different items in b:  2


## From Sets to Lists

Now that we're experts with working with Python sets.  Let's get savvy working with lists and unstructured data.

Using the split() method on a string, we can "split" it by a delimiter, to be used as a list.  By default, the .split() method can be applied to any string object, and will automatically split on spaces.  

*You can pass a parameter to split to change which character it will split on, such as ",", if you're trying to turn a comma seprated list of items into a list.*

The following will turn a space delimited *string* into a **list**.

In [7]:
"my name is dave my name is dave my name is dave".split()

['my',
 'name',
 'is',
 'dave',
 'my',
 'name',
 'is',
 'dave',
 'my',
 'name',
 'is',
 'dave']

What's up with this though?  Well all know "my name is dave", but if we had many values, it would be hard to know which of them are unique.  That's when we use sets.

In [8]:
set("my name is dave my name is dave my name is dave".split())

{'dave', 'is', 'my', 'name'}

Ok so we should know enough to conquer our jaccard distance problem, and step into our real problem:

## Who has similar tastes in music?

What we will attempt, is building a small process that takes feedback from a survey, mapping a distance function to find similar users based on Jaccard.

Along the way we will be:
* Working with requests
* Understanding Python fundamentals with sets and lists
* Cleaning up bad data
* Implementing Jaccard distance function
* Finding similar users

First, we will be taking a survey!  Let's all visit the survey posted in the channel before continuing.

*[Check out #General]*

Hopefully everything goes smooothly.  It's possible that I may need to modify the permissions on the sheet or provide a CSV snapshot if we hit a snag.

We will be loading our results via HTTP, then loading them into Pandas via StringIO which allows us to interoperate on strings as if they were file resources, then load them as a Dataframe.  This is setup for us now.

In [17]:
import pandas as pd
import requests

from StringIO import StringIO  

%matplotlib inline

spreadsheet = "https://docs.google.com/spreadsheets/d/1cpUb7XbN-qOq4xbGdYfhY9FtrMqRd0izz4PmTPMejt0/export?format=csv&id=1cpUb7XbN-qOq4xbGdYfhY9FtrMqRd0izz4PmTPMejt0&gid=216538035"
http = requests.get(spreadsheet)
csv_data = StringIO(http.content)
df = pd.read_csv(csv_data, index_col=0)

In [20]:
df.head(20)

Unnamed: 0_level_0,Name,Favorite Genres / Genres you like,What time of day do you like to listen to music?
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2/8/2016 2:09:12,Dave,"Blues, Classical, Electronic Music, Hip Hop / ...",24/7
2/8/2016 20:56:27,Brian Zhou,"Alternative Music, Dance, Electronic Music, Hi...",Night
2/8/2016 20:56:44,Colby,"Blues, Electronic Music, Indie Pop, Reggae","Morning, Night, Special occasions"
2/8/2016 20:56:46,Porpoises,"Alternative Music, Dance, Easy Listening, Elec...",24/7
2/8/2016 20:56:50,Tam,"Classical, Dance, Easy Listening, Jazz, Pop, R...",24/7
2/8/2016 20:57:12,Scully,"Alternative Music, Blues, Classical, European ...","Morning, Afternoon, Night, Special occasions"
2/8/2016 20:57:16,Mike Levine,"Classical, Hip Hop / Rap, Jazz, Pop, R&B / Sou...",24/7
2/8/2016 20:57:17,Yomi,"Jazz, Pop, R&B / Soul","Afternoon, Night"
2/8/2016 20:57:20,Esther,"Alternative Music, Indie Pop, Singer / Songwri...","Afternoon, Night, workday"
2/8/2016 20:57:26,Eric,"Alternative Music, Dance, Hip Hop / Rap, Asian...",Afternoon


**1. Rename the genre feature**

We get bad data from spreadsheets all the time.  This case, it's coming from a survey.  For ease of reference, rename the feature **"Favorite Genres / Genres you like"** to **"genres"**.


In [22]:
df.rename(columns={"Favorite Genres / Genres you like":"genres"},inplace=True)

**2. Select only your response from the new "genre" feature**

Try printing out only the first value, where df["Name"] == "[Your name]".

In [24]:
df['Name'] == "Yomi"

Timestamp
2/8/2016 2:09:12     False
2/8/2016 20:56:27    False
2/8/2016 20:56:44    False
2/8/2016 20:56:46    False
2/8/2016 20:56:50    False
2/8/2016 20:57:12    False
2/8/2016 20:57:16    False
2/8/2016 20:57:17     True
2/8/2016 20:57:20    False
2/8/2016 20:57:26    False
2/8/2016 20:57:26    False
2/8/2016 20:57:28    False
2/8/2016 20:57:40    False
2/8/2016 20:57:41    False
Name: Name, dtype: bool

**3. Take your survey response for "genre", and split it into a list, equal to the number of responses you chose**

For example if you chose "Blues, Reggae, Electronic Music", convert it to a list that looks like ["Blues", "Raggae", "Electronic Music"].

In [33]:
listsongs = df.genres[df['Name'] == "Yomi"]
listsongs = listsongs[0].split(",")
listsongs

['Jazz', ' Pop', ' R&B / Soul']

**4. Create a function that takes 2 lists, then calculate Jaccard distance**

0-60 mph I know but you can do this!  Double check our slides, and refer to the set operations for how to calculate this.  

Here is a boilerplate to get you going.

In [46]:
def jaccard(list1, list2):
    intersection = set(list1).intersection(set(list2))
    union = set(list1).union(set(list2))
    jdist = len(intersection)/float(len(union))
    return jdist
    # Update / your code here
    
list1 = ['blue', 'green', 'yellow']
list2 = ['black', 'orange', 'yellow', 'green']

jaccard(list1, list2)

0.4

**5.  Now for our final trick, calculate the distance between your genre preferences vs everyone else.**

Loop through everyone in the dataframe, create a list out of their "genre" string, echo out their name, then finally the distance between you and their sets.

In [58]:
jdistlist = {}
for x in df['Name']:
    name =  df['Name'][df['Name'] == x][0]
    genrelist = df['genres'][df['Name'] == x][0].split(",")
    jdist = jaccard(listsongs,genrelist) 
    jdistlist[name] = jdist
    print "Jaccard Distance between %s and me is %f" %(name,jdist)
    

   

Jaccard Distance between Dave and me is 0.000000
Jaccard Distance between Brian Zhou and me is 0.000000
Jaccard Distance between Colby and me is 0.000000
Jaccard Distance between Porpoises and me is 0.000000
Jaccard Distance between Tam and me is 0.111111
Jaccard Distance between Scully and me is 0.111111
Jaccard Distance between Mike Levine and me is 0.222222
Jaccard Distance between Yomi and me is 1.000000
Jaccard Distance between Esther and me is 0.000000
Jaccard Distance between Eric and me is 0.222222
Jaccard Distance between Lexi and me is 0.222222
Jaccard Distance between Ilma and me is 0.000000
Jaccard Distance between Mike Steiner and me is 0.000000
Jaccard Distance between dexter/falafel and me is 0.100000


NameError: name 'operator' is not defined

**Optional 6. Try calculating the distance on the time of day feature.**

Try to make a new dataframe, for just you vs everyone, using jaccard, and time of day.  Is there any interesting patterns you see?

In [None]:
df

**Optional 7. What can you say about the selection of options for genre or time and what they mean?**