<h1>
<center>
Module 2 - First look at tweets
</center>
</h1>
<div class=h1_cell>
<p>
Social media is a hot spot of activity with people trying to determine what is being said in things like Facebook and Twitter. They want to know if someone is angry or not, if someone likes their product or not, if information is fake or not, if someone is breaking the rules for posting. I'd like to look at twitter in this module, and see if we can classify a tweet as carrying a hate message. I was able to round up a dataset of roughly 32K tweets in csv format, i.e., same format we used to look at passenger names on the Titanic. And the tweets are already labeled, just as Titanic data was labeled. 
<p>
Let's bring in the tweets then talk about them. But first, load your library.

</div>

In [0]:
#flush the old directory
!rm -r  'w20_ds_library'

In [0]:
my_github_name = 'FutureDeus'  #replace with your account name

In [0]:
clone_url = f'https://github.com/{my_github_name}/w20_ds_library.git'

In [0]:
#get the latest. I am using the --branch option because I created a release for week 1 called w1.0
!git clone $clone_url --branch w1.0   


Cloning into 'w20_ds_library'...
remote: Enumerating objects: 17, done.[K
remote: Counting objects:   5% (1/17)[Kremote: Counting objects:  11% (2/17)[Kremote: Counting objects:  17% (3/17)[Kremote: Counting objects:  23% (4/17)[Kremote: Counting objects:  29% (5/17)[Kremote: Counting objects:  35% (6/17)[Kremote: Counting objects:  41% (7/17)[Kremote: Counting objects:  47% (8/17)[Kremote: Counting objects:  52% (9/17)[Kremote: Counting objects:  58% (10/17)[Kremote: Counting objects:  64% (11/17)[Kremote: Counting objects:  70% (12/17)[Kremote: Counting objects:  76% (13/17)[Kremote: Counting objects:  82% (14/17)[Kremote: Counting objects:  88% (15/17)[Kremote: Counting objects:  94% (16/17)[Kremote: Counting objects: 100% (17/17)[Kremote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects:   6% (1/16)[Kremote: Compressing objects:  12% (2/16)[Kremote: Compressing objects:  18% (3/16)[Kremote: Compressing objects:  25% (4/16

In [0]:
from w20_ds_library import *

In [0]:
%whos

Variable            Type            Data/Info
---------------------------------------------
Callable            CallableMeta    typing.Callable
TypeVar             TypingMeta      typing.TypeVar
clone_url           str             https://github.com/FutureDeus/w20_ds_library.git
cm_accuracy         function        <function cm_accuracy at 0x7f41851db8c8>
dframe              TypeVar         ~pd.core.frame.DataFrame
hello_ds            function        <function hello_ds at 0x7f41851db510>
knn                 function        <function knn at 0x7f41851db7b8>
knn_tester          function        <function knn_tester at 0x7f41851db840>
my_github_name      str             FutureDeus
ordered_distances   function        <function ordered_distances at 0x7f41851db6a8>
pd                  module          <module 'pandas' from '/u<...>ages/pandas/__init__.py'>
sortSecond          function        <function sortSecond at 0x7f41851db730>
w20_ds_library      module          <module 'w20_ds_library.w<...

In [0]:
pd.set_option('display.max_columns', None)

In [0]:
import pandas as pd
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRpZtv1ZFa7Am4j7U8S4JDYTuGkAp3rEyVh7riN8nVLEcaos_wgoAyJiRiE1oe8aITeex8BG-z6Sj5-/pub?output=csv'
tweet_table = pd.read_csv(url) 

Rename columns

In [0]:
tweet_table = tweet_table.rename(columns={"author": "label", "text": "tweet"})

In [0]:
tweet_table.head()  #reminder: label is 1 for hate tweets

Unnamed: 0,label,tweet
0,0,@user when a father is dysfunctional and is so...
1,0,@user @user thanks for #lyft credit i can't us...
2,0,bihday your majesty
3,0,#model i love u take with u all the time in ...
4,0,factsguide: society now #motivation


In [0]:
len(tweet_table)

31962

<div class=h1_cell>
<p>
Now check for duplicates. Looks like we have around 2K. Pandas makes it easy to drop them.
</div>

In [0]:
all_tweets = tweet_table['tweet']  #pull out the tweet column
print(len(all_tweets))
print(len(set(all_tweets)))  # set removes duplicates

31962
29509


In [0]:
tweet_table = tweet_table.drop_duplicates('tweet')  #and yet again, pandas has the method we want. Go pandas.
len(tweet_table)

29509

One gotcha from dropping rows is that the indices do not change. You can see this below where I ask to see the last row. It still has index 31961.

In [0]:
tweet_table.iloc[-1:]

Unnamed: 0,label,tweet
31961,0,thank you @user for you follow


Straightforward way around it is to reset the index so starts at 0 and goes up by 1. Pandas has a slick method to do this. BTW: if you set drop to False, pandas will create a new column `index` and place the old value there. We don't care about the old value so dropping it.

In [0]:
tweet_table = tweet_table.reset_index(drop=True)
tweet_table.iloc[-1:]

Unnamed: 0,label,tweet
29508,0,thank you @user for you follow



<div class=h1_cell>
<p>
You can aee we have roughly 30K tweets/rows. Each row has a tweet and a label. The label is 1 if a human thought it was a hate-tweet and 0 if not. Our goal is to build a model that will predict whether a tweet should be labeled as hateful.
<p>


Now let's lower the text in all the tweets.

In [0]:
tweet_table['tweet'] = tweet_table.apply(lambda r: r['tweet'].lower(), axis=1)

<h2>How are the 2 classes distributed?</h2>

It is useful to know what percentage of the 30K tweets are hateful.

In [0]:
tweet_table['label'].value_counts()

0    27500
1     2009
Name: label, dtype: int64

Uh, that is a low percentage. So if we guess a tweet is non-hateful we will be correct 93% of the time. If we cannot do better than 93% accuracy then what's the point. Keep that in mind for later. For this week, we will wrangle our data to have an even number of hate and nohate tweets. But next week we will go with the raw distribution.

The tweets have a mixture of normal chars and emojis. The emojis are complex because they can be linked/stacked, e.g., 4 sequential bytes can be one emoji. You can see below that the print statement does this stacking and prints out 4 bytes as one emoji. When I look at the raw tweet, I can see the bytes, e.g. \xc3 is the byte containing hex c3, etc.

</div>

In [0]:
row3_raw = str(tweet_table.loc[3,'tweet'])
row3_raw

'#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦'

In [0]:
print(row3_raw)  # you can see the print function can do some parsing of those hex codes

#model   i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦


<h2>
No emojis for now
</h2>
<div class=h1_cell>
<p>
I am going to put off trying to deal with emojis. I think it is a bit of a side-track at this point. But we may want to return to it later. I could see that emojis are a kind of "emotional language" that might be able to signal intent.
<p>
If you are interested, below are the emoji codes that I found for row3_raw using 4 byte-sequences. But big caveat. One of my grad students believes that the actual Emoji is a phone that stacks more than 4 bytes.
</div>

\xf0\x9f\x93\xb1 https://apps.timwhitlock.info/unicode/inspect/hex/1F631    1F600-1F64F
 
\xf0\x9f\x98\x99	http://www.unicode.org/Public/UNIDATA/NamesList.txt KISSING FACE WITH SMILING EYES
 
\xf0\x9f\x98\x8e 1F60E	SMILING FACE WITH SUNGLASSES
 
\xf0\x9f\x98\x84  https://apps.timwhitlock.info/unicode/inspect/hex/1F444  1F300-1F5FF
 
\xf0\x9f\x91\x85   https://apps.timwhitlock.info/unicode/inspect/hex/1F445   1F300-1F5FF

\xf0\x9f\x92\xa6  https://apps.timwhitlock.info/unicode/inspect/hex/1F4A6     1F300-1F5FF

# Challenge 1

I'd like to see if we can carry over our ideas of using KNN and character counts from the Titanic. Instead of passenger names, we now have tweets. But similar idea. To start, add a column for the length of each tweet.

In [0]:
#your code here
tweet_table['length'] = tweet_table.apply(lambda row: len(row['tweet']), axis=1)

In [0]:
tweet_table.head()

Unnamed: 0,label,tweet,length
0,0,@user when a father is dysfunctional and is so...,101
1,0,@user @user thanks for #lyft credit i can't us...,122
2,0,bihday your majesty,19
3,0,#model i love u take with u all the time in ...,84
4,0,factsguide: society now #motivation,38


# Challenge 2

Add a column for a count of the hashtags in each tweet. Why? Mostly just a guess they might be useful.

In [0]:
#your code here
def hash_counter(row_tweet):
  n = 0
  for i in row_tweet:
    if i == '#':
      n += 1
  return n

tweet_table['#'] = tweet_table.apply(lambda row: hash_counter(row['tweet']), axis=1)

In [0]:
tweet_table.head()

Unnamed: 0,label,tweet,length,#
0,0,@user when a father is dysfunctional and is so...,101,1
1,0,@user @user thanks for #lyft credit i can't us...,122,3
2,0,bihday your majesty,19,0
3,0,#model i love u take with u all the time in ...,84,1
4,0,factsguide: society now #motivation,38,1


# Challenge 3

Add a column for a count of the exclamation points in each tweet. Why? Ditto.

In [0]:
#your code here
def exclamation_counter(row_tweet):
  n = 0
  for i in row_tweet:
    if i == '!':
      n += 1
  return n

tweet_table['!'] = tweet_table.apply(lambda row: exclamation_counter(row['tweet']), axis=1)

In [0]:
tweet_table.head()

Unnamed: 0,label,tweet,length,#,!
0,0,@user when a father is dysfunctional and is so...,101,1,0
1,0,@user @user thanks for #lyft credit i can't us...,122,3,0
2,0,bihday your majesty,19,0,0
3,0,#model i love u take with u all the time in ...,84,1,3
4,0,factsguide: society now #motivation,38,1,0


# Challenge 4

Create columns for the 26 chars and then fill them with the correct values. Should be able to use Titanic code pretty much as is other than changing a few values.

In [0]:
#your loop here
alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']

def mein_counter(row_name, cur_alphabet):
  n = 0
  for i in row_name.lower():
    if i == cur_alphabet:
      n += 1
  return n

for i in alphabet:
  tweet_table[i] = tweet_table.apply(lambda row: mein_counter(row['tweet'], i), axis=1)


In [0]:
tweet_table.head()

Unnamed: 0,label,tweet,length,#,!,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
0,0,@user when a father is dysfunctional and is so...,101,1,0,5,0,2,5,5,4,1,6,9,0,1,2,0,8,4,0,0,4,12,4,4,0,1,0,2,0
1,0,@user @user thanks for #lyft credit i can't us...,122,3,0,7,0,4,6,12,4,1,5,6,0,2,2,0,7,4,2,0,6,7,9,4,1,1,1,2,0
2,0,bihday your majesty,19,0,0,2,1,0,1,1,0,0,1,1,1,0,0,1,0,1,0,0,1,1,1,1,0,0,0,3,0
3,0,#model i love u take with u all the time in ...,84,1,3,2,0,0,1,5,0,0,2,4,0,1,4,2,1,2,0,0,1,0,4,3,1,1,0,0,0
4,0,factsguide: society now #motivation,38,1,0,2,0,2,1,2,1,1,0,4,0,0,0,1,2,4,0,0,0,2,4,1,1,1,0,1,0


Here is my result:

<img src='https://www.dropbox.com/s/e5or1kcyyt4e30g/Screenshot%202020-01-02%2008.44.13.png?raw=1'>

<h2>
Challenge 5 - build new KNN distance measure
</h2>
<div class=h1_cell>
<p>
We used Euclidean Distance in week 1. There is a more preferred measure for dealing with text analytics: cosine similarity. The general idea is that we view an n-vector as a line in n-space. Given 2 such vectors, we can determine their cosine similarity. This is from a (free!) online book that gives some good examples: see page 121 here https://nlp.stanford.edu/IR-book/pdf/06vect.pdf. We won't really need the cosine method until later but might as well start practicing with it now.
<p>
As a side note, for NLP problems there is often similarity between features. For instance, we expect 'q' and 'u' to co-occur. In this case, we might want something slightly more complicated: the soft cosine measure (https://en.wikipedia.org/wiki/Cosine_similarity#Soft_cosine_measure). We may get back to this if time.
<p>
Implement the function. And no libraries for this please. Want raw Python code.
<p>
<img src='https://www.dropbox.com/s/oi1ttx99hf0uejn/cosine.png?raw=1'>
</div>

In [0]:
def cosine_similarity(vect1:list ,vect2:list) -> float:
  assert isinstance(vect1, list), f'vect1 is not a list but a {type(vect1)}'
  assert isinstance(vect2, list), f'vect2 is not a list but a {type(vect2)}'
  assert len(vect1) == len(vect2), f"Mismatching length for vectors: {len(vect1)} and {len(vect2)}"
  
  #your code here
  top, bot_a, bot_b = 0, 0, 0
  for i in range(len(vect1)):
    top += vect1[i]*vect2[i]
    bot_a += vect1[i]**2
    bot_b += vect2[i]**2
  return (top/(((bot_a)**(1/2))*((bot_b)**(1/2))))

In [0]:
row0 = tweet_table.iloc[0]
row1 = tweet_table.iloc[1]

In [0]:
vector0 = row0.drop(['label', 'tweet'], axis=0).tolist()
vector1 = row1.drop(['label', 'tweet'], axis=0).tolist()
print((vector0,vector1))  #([101, 1, 0, 5, 0, 2, 5, 5, 4, 1, 6, 9, 0, 1, 2, 0, 8, 4, 0, 0, 4, 12, 4, 4, 0, 1, 0, 2, 0], [122, 3, 0, 7, 0, 4, 6, 12, 4, 1, 5, 6, 0, 2, 2, 0, 7, 4, 2, 0, 6, 7, 9, 4, 1, 1, 1, 2, 0])

([101, 1, 0, 5, 0, 2, 5, 5, 4, 1, 6, 9, 0, 1, 2, 0, 8, 4, 0, 0, 4, 12, 4, 4, 0, 1, 0, 2, 0], [122, 3, 0, 7, 0, 4, 6, 12, 4, 1, 5, 6, 0, 2, 2, 0, 7, 4, 2, 0, 6, 7, 9, 4, 1, 1, 1, 2, 0])


In [0]:
cosine_similarity(vector0, vector1)  #0.9947943119681618

0.9947943119681618

## Bit of a problem

For Euclidean Distance, the smaller the value the better. So we want to order on smallest to largest.

For Cosine Similarity, it is the inverse. A value of 1 (the max value possible) is the best. A value of 0 is the worst. To keep using our ordering of smallest to largest, I am going to invert the cosine function by subtracting its value from 1. This should give us 0 as the best match and 1 as the worst match.

In [0]:
def inverse_cosine_similarity(vect1:list ,vect2:list) -> float:
  assert isinstance(vect1, list), f'vect1 is not a list but a {type(vect1)}'
  assert isinstance(vect2, list), f'vect2 is not a list but a {type(vect2)}'
  assert len(vect1) == len(vect2), f"Mismatching length for vectors: {len(vect1)} and {len(vect2)}"

  normal_result = cosine_similarity(vect1, vect2)
  return 1.0 - normal_result

In [0]:
inverse_cosine_similarity(vector0, vector1)  #0.005205688031838185, i.e., the inverse of normal result above

0.005205688031838185

In [0]:
row2 = tweet_table.iloc[2]

In [0]:
vector0 = row0.drop(['label', 'tweet'], axis=0).tolist()
vector2 = row2.drop(['label', 'tweet'], axis=0).tolist()
print((vector0,vector2))

([101, 1, 0, 5, 0, 2, 5, 5, 4, 1, 6, 9, 0, 1, 2, 0, 8, 4, 0, 0, 4, 12, 4, 4, 0, 1, 0, 2, 0], [19, 0, 0, 2, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 3, 0])


In [0]:
inverse_cosine_similarity(vector0, vector2)

0.02165567579392158

## I admit I am being cruel

I am making you program the function from scratch. I think it is good practice!

But if we want to use fancy packages, we can do it in one line:
<pre>
dot(vector0, vector2)/(norm(vector0)*norm(vector2))
</pre>
We can use that as an oracle to check against our own function result, which is `0.02165567579392158`.

In [0]:
#oracle
import math
from scipy import spatial
from numpy import dot
from numpy.linalg import norm


In [0]:
#Simplest
spatial.distance.cosine(vector0, vector2)  #note it does inverse for us

0.02165567579392169

In [0]:
#A bit more involved
1.0 - dot(vector0, vector2)/(norm(vector0)*norm(vector2))  #note I am inverting the result

0.02165567579392158

<h2>
Challenge 6 - run K-NN on our data and see how we do
</h2>
<div class=h1_cell>
<p>
We will break the larger table into a testing table and a crowd table. But the larger table is so extreme in its distribution, roughly 93% non-hate, that I want to try an experiment. What if haters and non-haters were equally divided? How well would KNN work then. Spoiler alert: not very well. Let's check it out.
</div>

Here's my plan of attack:

1. I'll shuffle the big tweet_table randomly.
2. I want a testing table that is 300 hate tweets and 300 non-hate tweets.
3. I want a crowd table that is 300 hate tweets and 300 non-hate tweets.
4. I want to try my new cosine distance function out on these.

Let's shuffle the table.

In [0]:
import numpy as np
rsgen = np.random.RandomState(1234)

In [0]:
#fill in the code

shuffled_table = tweet_table.sample(n=len(tweet_table), random_state=rsgen)

See this for a way to even out training and testing size using fancier packages: https://medium.com/vickdata/detecting-hate-speech-in-tweets-natural-language-processing-in-python-for-beginners-4e591952223

Here is a helper function.

In [0]:
def filter_by_column_value(df, col, value):
  filtered_table = df.loc[df[col] == value]
  return filtered_table

In [0]:
hate_table = filter_by_column_value(shuffled_table, 'label', 1)
hate_table.head()

Unnamed: 0,label,tweet,length,#,!,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
20773,1,thought factory: left-right polarisation! #tru...,104,6,1,4,2,3,1,6,2,2,4,8,0,0,6,2,2,6,4,0,6,5,10,3,0,0,1,1,0
3012,1,@user check this out! i mentioned this org to ...,109,0,1,5,1,4,2,7,2,2,4,8,0,2,2,2,6,7,3,0,6,4,5,4,0,1,0,4,0
6739,1,i can't believe that they don't show blazing s...,120,0,0,12,4,4,4,10,0,1,5,4,1,1,5,2,6,5,0,0,2,7,11,2,2,1,0,2,1
22898,1,if you're mad about @user and their interracia...,119,3,0,8,2,6,2,13,3,1,3,6,0,2,5,1,3,6,4,0,10,3,7,7,0,2,0,2,0
26322,1,princess eugenie caught up in 'racism' row ove...,80,0,0,5,0,5,0,8,0,3,3,7,0,0,1,1,5,3,4,0,5,6,3,3,1,2,0,1,0


In [0]:
nohate_table = filter_by_column_value(shuffled_table, 'label', 0)
nohate_table.head()

Unnamed: 0,label,tweet,length,#,!,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
7740,0,happy weekend! ð#quoteoftheday #quote #mind...,71,5,1,2,1,0,3,9,2,0,2,1,1,1,1,3,3,3,2,2,1,4,4,5,0,1,0,2,0
24467,0,i have 1 homegirl ðð,26,0,0,1,0,0,0,2,0,1,2,2,0,0,1,1,0,1,0,0,1,0,0,0,1,0,0,0,0
12031,0,@user pray for my country. 2days flooding. g...,96,4,0,11,0,3,2,3,4,3,3,4,0,0,1,1,5,4,2,0,6,5,3,3,0,2,0,4,0
2606,0,can't believe i was up through the 3hrs of #or...,136,2,0,4,1,2,2,11,2,3,6,4,0,0,5,0,10,10,2,0,6,6,11,3,2,6,0,1,0
2997,0,great way to finish a great week off. #workupa...,102,6,0,7,0,2,0,11,6,2,4,6,0,2,3,1,2,4,1,0,4,7,8,2,0,4,0,3,0


In [0]:
testing_table = hate_table[:300].append(nohate_table[:300])  #append first 300 rows from each table
testing_table = testing_table.reset_index(drop=True)
testing_table = testing_table.drop(['tweet'], axis=1)

In [0]:
testing_table.head()

Unnamed: 0,label,length,#,!,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
0,1,104,6,1,4,2,3,1,6,2,2,4,8,0,0,6,2,2,6,4,0,6,5,10,3,0,0,1,1,0
1,1,109,0,1,5,1,4,2,7,2,2,4,8,0,2,2,2,6,7,3,0,6,4,5,4,0,1,0,4,0
2,1,120,0,0,12,4,4,4,10,0,1,5,4,1,1,5,2,6,5,0,0,2,7,11,2,2,1,0,2,1
3,1,119,3,0,8,2,6,2,13,3,1,3,6,0,2,5,1,3,6,4,0,10,3,7,7,0,2,0,2,0
4,1,80,0,0,5,0,5,0,8,0,3,3,7,0,0,1,1,5,3,4,0,5,6,3,3,1,2,0,1,0


Here is my result:

<img src='https://www.dropbox.com/s/6tgchc43lflwxpc/Screenshot%202020-01-02%2014.33.53.png?raw=1'>

In [0]:
crowd_table = hate_table[300:600].append(nohate_table[300:600])  #append next 300 rows from each table
crowd_table = crowd_table.reset_index(drop=True)
crowd_table = crowd_table.drop(['tweet'], axis=1)

In [0]:
crowd_table.head()

Unnamed: 0,label,length,#,!,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
0,1,78,1,0,6,1,2,2,8,2,2,3,6,0,0,2,0,3,6,0,0,4,4,4,6,1,3,0,4,0
1,1,106,2,0,9,2,1,2,10,0,1,6,4,0,0,5,2,4,6,0,0,5,7,9,2,0,2,0,2,0
2,1,121,0,0,6,1,3,7,7,2,5,3,7,0,0,0,2,8,9,5,0,6,6,2,7,2,4,0,3,0
3,1,108,0,0,8,3,3,2,11,2,1,0,4,0,1,3,3,5,6,2,0,7,9,2,8,0,0,0,2,0
4,1,63,7,1,5,2,1,2,4,0,0,1,4,0,1,2,2,2,2,4,0,4,2,4,3,0,3,0,1,1


Here is my result:

<img src='https://www.dropbox.com/s/6x2h1tx6xm5ltyx/Screenshot%202020-01-02%2014.35.04.png?raw=1'>

<h3>Let's do a small test</h3>

In [0]:
target_row = testing_table.iloc[0]
target_row['label']  #1

1

In [0]:
target_vector = target_row.drop(['label'], axis=0).tolist()
print(target_vector)  #[104, 6, 1, 4, 2, 3, 1, 6, 2, 2, 4, 8, 0, 0, 6, 2, 2, 6, 4, 0, 6, 5, 10, 3, 0, 0, 1, 1, 0]

[104, 6, 1, 4, 2, 3, 1, 6, 2, 2, 4, 8, 0, 0, 6, 2, 2, 6, 4, 0, 6, 5, 10, 3, 0, 0, 1, 1, 0]


<div class=h1_cell>

Start with just checking row0.
<p>
</div>

In [0]:

top5_rows = ordered_distances(target_vector, crowd_table, 'label', inverse_cosine_similarity)[:5]  # passing in distance function


In [0]:
top5_rows  #[(161, 1.7724473375668737e-06), (278, 1.7724473375668737e-06), (140, 7.215183497544242e-05), (56, 0.0028838511132895928), (191, 0.0029398844406477442)]

[(161, 1.7724473375668737e-06),
 (278, 1.7724473375668737e-06),
 (140, 7.215183497544242e-05),
 (56, 0.0028838511132895928),
 (191, 0.0029398844406477442)]

In [0]:
knn(target_vector, crowd_table, 'label', 5, inverse_cosine_similarity)  # 1 (the correct label)

1

Try k=1 and check out what you see. Warning: this is taking me about 3 minutes a pop.

In [0]:
%%time
cdict = knn_tester(testing_table, crowd_table, 'label', 1, inverse_cosine_similarity)  #crowd size of 1

print(cdict) #{(0, 0): 150, (0, 1): 112, (1, 0): 150, (1, 1): 188}

{(0, 0): 155, (0, 1): 96, (1, 0): 145, (1, 1): 204}
CPU times: user 1min 50s, sys: 18.5 ms, total: 1min 50s
Wall time: 1min 50s


In [0]:
cm_accuracy(cdict)  #0.5633333333333334

0.5983333333333334

That's not good. We can get to .5 with just a coin flip. 
<p>
Let's try for k=5.

In [0]:

%%time
cdict = knn_tester(testing_table, crowd_table, 'label', 5, inverse_cosine_similarity)  #crowd size of 5

print(cdict) #{(0, 0): 146, (0, 1): 85, (1, 0): 154, (1, 1): 215}

{(0, 0): 147, (0, 1): 76, (1, 0): 153, (1, 1): 224}
CPU times: user 1min 52s, sys: 23.9 ms, total: 1min 52s
Wall time: 1min 52s


In [0]:
cm_accuracy(cdict)  #0.6016666666666667

0.6183333333333333

One more with k = 11.

In [0]:
%%time
cdict = knn_tester(testing_table, crowd_table, 'label', 11, inverse_cosine_similarity)  #crowd size of 11

print(cdict) #{(0, 0): 153, (0, 1): 86, (1, 0): 147, (1, 1): 214}

{(0, 0): 145, (0, 1): 50, (1, 0): 155, (1, 1): 250}
CPU times: user 1min 50s, sys: 14 ms, total: 1min 50s
Wall time: 1min 50s


In [0]:
cm_accuracy(cdict)  #0.6116666666666667

0.6583333333333333

<h2>
KNN a loser?
</h2>
<div class=h1_cell>
<p>
It certainly looks that way given the way I have set up the problem. If we assume that haters and non-haters are in equal proportion, then the columns we have are not doing a good job.  If you are asking your boss for a raise, you need to do better than that.

And I think we can do better with more meaningful columns. Just looking at character counts seems nonsensical. We can look at the words in the tweets instead of the characters - that should be much better.

Is it worth trying to reengineer the columns to salvage KNN? On the plus side, there is no training necessary with KNN. We have a crowd and that's what we use. But on the negative side, it is slow. Everytime we want a prediction, we have to run out distance measure against every row in the crowd table. We will look at another technique called Naive Bayes that has the advantages of KNN in no training necessary but can also be fairly fast.
</div>

<h2>Write out wrangled table</h2>

Let's store our table so we don't have to redo the wrangling later.

Note I am using a folder I created called class_tables. You can put your own folder name in or just admit the folder and write directly to My Drive. Up to you.

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
with open('/content/gdrive/My Drive/class_tables/tweets_shuffled.csv', 'w') as f:
  shuffled_table.to_csv(f, encoding='utf-8', index=False)

##Get url

We will be using the table in the next module. We could read the table in direclty off Google Drive. I show you how to do that below. But better is to get the url of the file and use it. Then do not have to mount the drive. I'll show you how in next module.

In [0]:
#This will read it back in but there is a better way
'''
with open('/content/gdrive/My Drive/class_tables/tweets_shuffled.csv', 'r') as f:
  shuffled_table = pd.read_csv(f)
'''

"\nwith open('/content/gdrive/My Drive/class_tables/tweets_shuffled.csv', 'r') as f:\n  shuffled_table = pd.read_csv(f)\n"

<h2>
Changes to your library
</h2>
<div class=h1_cell>
<p>
We defined several new functions so add them to your github library.
</div>