# Random Friday


Here are a few things that may be helpful. (Some of which took a bit of time to figure out - so hopefully this will save you some work.

First, let me create a sample dataframe


In [5]:
import pandas as pd
import numpy as np
from pandas import DataFrame

names = ['Ann', 'Becky', 'Clara', 'Dan', 'Lila', 'Enric']
status = ['in state', 'in state', 'out state', 'in state', 'out state', 'out state']
schoolClass = ['senior', 'freshman', 'junior', 'freshman', 'junior', 'senior']

students = DataFrame({'name': names, 'status': status, 'schoolClass':  schoolClass})
students



Unnamed: 0,name,schoolClass,status
0,Ann,senior,in state
1,Becky,freshman,in state
2,Clara,junior,out state
3,Dan,freshman,in state
4,Lila,junior,out state
5,Enric,senior,out state


## creating new columns that are conditional on other columns

Suppose I want to make the `schoolClass` and `status` columns numeric so I can do kNN on the dataset.  The status column is easy and we can use `np.where`. The `schoolClass` column requires that we write a little function. Here is how to do both (here I am keeping the original columns and just adding 2 more):

In [11]:
students['in state'] = np.where(students['status']=='in state', 1, 0)

# now create a lambda function for the School Class conversion
categorize = lambda value: [1,2,3,4][['freshman', 'sophomore', 'junior', 'senior'].index(value)]

students['year'] =students['schoolClass'].apply(categorize)
students

Unnamed: 0,name,schoolClass,status,in state,year
0,Ann,senior,in state,1,4
1,Becky,freshman,in state,1,1
2,Clara,junior,out state,0,3
3,Dan,freshman,in state,1,1
4,Lila,junior,out state,0,3
5,Enric,senior,out state,0,4


## the minmax scaler

In the intro to sklearn notebook we saw how to scale values:

In [83]:
simple = DataFrame({'age': [26, 67, 28], 'salary': [80000, 115000, 100000]}, index=['Mr. Cool', 'Old Dude', 'Ann'])
simple

Unnamed: 0,age,salary
Mr. Cool,26,80000
Old Dude,67,115000
Ann,28,100000


Now using the minmax scaler

In [84]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
simple[['age', 'salary']] =  scaler.fit_transform(simple[['age', 'salary']] )
simple

Unnamed: 0,age,salary
Mr. Cool,0.0,0.0
Old Dude,1.0,1.0
Ann,0.04878,0.571429


Suppose these people are members of the House of Representatives. Here's how they voted on the "Treat Everyone with Compassion" act:

In [85]:
simple_labels = DataFrame({'vote': [1, 0, 1]}, index=['Mr. Cool', 'Old Dude', 'Ann'])
simple_labels

Unnamed: 0,vote
Mr. Cool,1
Old Dude,0
Ann,1


Let's build a classifier


In [86]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(simple, simple_labels['vote'])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

And now I have a few people I would like to classify as to how they voted:

In [87]:
others = DataFrame({'age': [55, 69, 35], 'salary': [80000, 125000, 95000]}, index=['Rev. Daiho', 'Big Bucks Bob', 'Jasmine'])
others

Unnamed: 0,age,salary
Rev. Daiho,55,80000
Big Bucks Bob,69,125000
Jasmine,35,95000


Now we want to scale those values, but we want to use the same scale I used for the training set. 

This is easy. When I transformed the training set I used `scaler.fit_transform`. In addition to transforming the dataframe columns, it trained the scaler. When I want to transform a dataset using an already trained scaler I can use `scaler.transform`

In [88]:
others[['age', 'salary']] =  scaler.transform(others[['age', 'salary']] )
others


Unnamed: 0,age,salary
Rev. Daiho,0.707317,0.0
Big Bucks Bob,1.04878,1.285714
Jasmine,0.219512,0.428571


Now I can classify those people:

In [89]:
knn.predict(others)

array([1, 0, 1])

Ok, so we predict that Big Bucks Bob did not vote for the *Treat Everyone with Compassion* Act.


## Removing rows with missing data

There are many options when dealing with rows with missing data. I am going to skip that entire discussion and just show you how to remove rows that contain missing data. First, some data:

In [77]:
data = pd.read_csv('https://raw.githubusercontent.com/zacharski/machine-learning/master/data/athletesMissingValue.csv')
data


Unnamed: 0,Name,Sport,Height,Weight
0,Asuka Teramoto,Gymnastics,54.0,66
1,Brittainey Raven,Basketball,,162
2,Chen Nan,Basketball,78.0,204
3,Gabby Douglas,Gymnastics,49.0,90
4,Helalia Johannes,Track,65.0,99
5,Irina Miketenko,Track,,106
6,Jennifer Lacy,Basketball,75.0,175
7,Kara Goucher,Track,67.0,123
8,Linlin Deng,Gymnastics,54.0,68
9,Nakia Sanford,Basketball,76.0,200


As you can see, some of the rows contain missing data. For example, we don't have a height for Brittainey Raven.  We can remove all rows that have missing values by:


In [78]:
data.dropna()

Unnamed: 0,Name,Sport,Height,Weight
0,Asuka Teramoto,Gymnastics,54.0,66
2,Chen Nan,Basketball,78.0,204
3,Gabby Douglas,Gymnastics,49.0,90
4,Helalia Johannes,Track,65.0,99
6,Jennifer Lacy,Basketball,75.0,175
7,Kara Goucher,Track,67.0,123
8,Linlin Deng,Gymnastics,54.0,68
9,Nakia Sanford,Basketball,76.0,200
10,Nikki Blue,Basketball,68.0,163
11,Qiushuang Huang,Gymnastics,61.0,95


As you can see, Brittainey vanishes!


## groupby
Remember that problem we had on the Pandas Dataframe to fill in a table that looked like:


  x | Avg. BMI | Avg. Diabetes Pedigree | Avg. times pregnant | Avg. Plasma glucose |
---   | :---: | :---: | :---: | :---: |
Has Diabetes |   | |  |
Doesn't have Diabetes |   | |  |

Chances are good that it took you a bit of code to come up with your solution. Using `groupby` would have made your life easier (you have every right to grumble that I should have mentioned this earlier):

In [81]:
# your code here
cols = ['pregnant', 'glucose', 'bp', 'skinFold', 'insulin', 'bmi', 'pedigree', 'age', 'diabetes']
d6 = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data', names=cols)
d6.groupby('diabetes').mean()


Unnamed: 0_level_0,pregnant,glucose,bp,skinFold,insulin,bmi,pedigree,age
diabetes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


There is one useful thing (among many) that I use this for. Let's say we build a classifier to determine whether or not a person has diabetes and it is 60% accurate. That is not great but it is better than chance so I think we will should celebrate this small victory!

*{Cue ominous music}*

#### or is it better than chance?

Suppose instead of building a classifier I am just going to guess. What is the best guess to make? To answer this I will use the handydandy `groupby` function.



In [93]:
d6.groupby('diabetes').count()


Unnamed: 0_level_0,pregnant,glucose,bp,skinFold,insulin,bmi,pedigree,age
diabetes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,500,500,500,500,500,500,500,500
1,268,268,268,268,268,268,268,268


So if I just all the time guess that people do not have diabetes I will be (500 / 768) or 65% accurate. So the classifier we built was actually worse than chance. Bummer.  So it is a good idea to use `groupby` just to see what the guessing baseline is.


### cheating yourself 

Let's build that diabetes classifier. First we will get some dataframes set up for the problem:

In [97]:
diabetes_features = d6[['pregnant', 'glucose', 'bp', 'skinFold', 'insulin', 'bmi', 'pedigree', 'age']]
diabetes_labels = d6['diabetes']


and for now let's not bother with scaling anything. 

##### build the knn classifier

In [98]:
from sklearn.neighbors import KNeighborsClassifier
k1nn = KNeighborsClassifier(n_neighbors=1)
k1nn.fit(diabetes_features, diabetes_labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

Now I am going to test how accurate the classifier is:

In [104]:
predicted =k1nn.predict(diabetes_features)

##  and now I am using a super long way to calculate accuracy
## (because the easy way is a question on a previsous notebook)
actual = np.array(diabetes_labels)
total = len(actual)
correct = 0
for i in range(len(actual)):
    if actual[i] == predicted[i]:
        correct += 1
print(correct / total)




1.0


**Wow** Our algorithm is 100% accurate. Now it is time for a celebration!

But is it?

We constructed a kNN classifier with a k of 1 (so we are using a single nearest neighbor). And to compute the accuracy we used the same data we trained the classifier on. And the nearest neighbor for person number 1, will be person number 1, for person number 2 it will be person number 2, .... And we know whether that person had diabetes or not! **we misled outselves into thinking we had a great classifier** The problem is most obvious with kNN with a k of 1, bt