## Introduction to DNA Translation


A - Adenine 
 
C - Cytosine 
 
G - Guanine 
 
T - Thymine


central dogma of molecular biology that describes the basic flow of genetic information
DAN  -> RNA -> Protine


Following primary steps are required for this case study
1. Download a DNA sequence
2. Translate the DNA sequence into amino acids
3. Dowanload amino acid sequence to check our solutions


`Example:`

> __Input__ DNA sequence : ATACAATGGCAA
__Output__ :
> Amino acids : I Q W Q

***where as*** , <br>
ATA -> I <br>
CAA -> Q <br>
TGG -> W <br>
CAA -> Q <br>

- Task 1: Manually dowaload DNA and protien sequence data. 
- Task 2: Import the DNA data into python. 
- Task 3. Create an algorithm to translate the DNA. 
- Task 4. Check if translation matches your download.

## Downloading DNA Data

how to __download DNA and protein__ sequence data from the __NCBI__

Will download two files from __NCBI__
1. Strand of DNA 
2. Corresponding protein sequence


*Note: go to NCBI site, select Nucleotide database then serach this keyword __NM_207618.2__ [Download](https://www.ncbi.nlm.nih.gov/nuccore/NM_207618.2?report=fasta)

## Importing DNA Data Into Python

In [2]:
inputfile = "python_case_study/translation/dna.txt"
f = open(inputfile, "r")
seq = f.read()

In [4]:
print(seq)

GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA
GATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT
CCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT
TAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT
CAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG
AGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA
ACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA
GGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT
TTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA
GTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA
CCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT
TATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT
GCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG
TCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTT
GCTAAT

## Translating the DNA Sequence

- To translate the DNA sequence using a dictionary lookup
- To check the length of the sequence using the modulo operator


1. Check that length of sequence is divisible by 3 
2. Look up each 3-letter string in table and store result 
3. Continue lookups until reaching end of sequence

In [5]:
def translate(seq):
    """Translate a string containing a nucleotide sequence into a string containing the corresponding sequence of amino acids. 
    Nucleotides are translated in triplets using the table dictionary each amino acid is encoded with a string of length 1."""
    
    table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

    protein = ""
    # Check that length of sequence is divisible by 3
    if len(seq) % 3 == 0:

        # Loop over the sequence 
        for i in range(0,len(seq),3):

            # extract a single codon
            codon = seq[i : i+3]

            # look up the codon and store result
            protein += table[codon]
    return protein

In [6]:
seq[0:3]

'GGT'

In [7]:
list(range(0,11,3))

[0, 3, 6, 9]

In [8]:
translate("ATA")

'I'

In [9]:
translate("AGA")

'R'

In [10]:
help(translate)

Help on function translate in module __main__:

translate(seq)
    Translate a string containing a nucleotide sequence into a string containing the corresponding sequence of amino acids. 
    Nucleotides are translated in triplets using the table dictionary each amino acid is encoded with a string of length 1.



## Comparing Your Translation

- how to use the with statement to read in an entire file
- how to compare your translation to the protein sequence you downloaded

In [11]:
inputfile = "python_case_study/translation/dna.txt"
with open(inputfile, "r") as f:
    seq = f.read()

In [12]:
def read_seq(inputfile):
    """ Reads and returns the input sequence with special characters removed"""
    with open(inputfile, "r") as f:
        seq = f.read()
    seq = seq.replace("\n", "")
    seq = seq.replace("\r", "")
    return seq    

In [13]:
prt = read_seq("python_case_study/translation/protein.txt")

In [14]:
dna = read_seq(inputfile)

In [17]:
translate(dna[20:938])

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC_'

In [18]:
prt

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'

In [19]:
translate(dna[20:935])

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'

In [20]:
prt == translate(dna[20:935])

True

In [21]:
prt == translate(dna[20:938])[:-1]

True

In [22]:
translate(dna[20:938])[:-1] == translate(dna[20:935])

True

# Introduction to Language Processing

__Project Gutenberg__, the oldest digital library of books

- Book Lengths, 
- Number of unique words 
- How attributes cluster by language/author

# Counting words

- Learn how to write your own function to count the number of times a unique word appears in a given string text
- Learn about how to use the Counter tool from the collections module to accomplish the same task

In [36]:
text = "This is my text and keep it short and simple."

In [37]:
def count_words(text):
    """
    count the number of times each word occurs in text. Return Dictionary
    where key are unique words and values are word counts. Skip punctuation
    """
    text = text.lower()
    skips = [".", ",", ";", ":", "'",'"']
    for char in skips:
        text = text.replace(char, "")
        
    word_count = {}
    for word in text.split(" "):
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    return word_count        

In [38]:
count_words(text)

{'this': 1,
 'is': 1,
 'my': 1,
 'text': 1,
 'and': 2,
 'keep': 1,
 'it': 1,
 'short': 1,
 'simple': 1}

In [39]:
from collections import Counter

def count_words_fast(text):
    """
    count the number of times each word occurs in text. Return Dictionary
    where key are unique words and values are word counts. Skip punctuation
    """
    text = text.lower()
    skips = [".", ",", ";", ":", "'",'"']
    for char in skips:
        text = text.replace(char, "")
        
    word_count = Counter(text.split(" "))
    return word_count        

In [40]:
count_words_fast(text)

Counter({'this': 1,
         'is': 1,
         'my': 1,
         'text': 1,
         'and': 2,
         'keep': 1,
         'it': 1,
         'short': 1,
         'simple': 1})

In [41]:
count_words(text) == count_words_fast(text)

True

In [42]:
len(count_words("This comprehension check is to check for comprehension."))

6

In [43]:
count_words(text) is count_words_fast(text)

False

## Reading in a Book
- Read in a book from file

In [44]:
def read_book(title_path):
    """
    Read a book and return it is as string.
    """
    with open(title_path, "r", encoding="UTF-8") as current_file:
        text = current_file.read()
        text = text.replace("\n", "").replace("\r", "")
    return text

In [45]:
text = read_book("resource/books/English/shakespeare/Romeo and Juliet.txt")
index = text.find("What's in a name?")
sample_text = text[index : index + 1000]

In [46]:
index

42757

In [47]:
sample_text

"What's in a name? That which we call a rose    By any other name would smell as sweet.    So Romeo would, were he not Romeo call'd,    Retain that dear perfection which he owes    Without that title. Romeo, doff thy name;    And for that name, which is no part of thee,    Take all myself.  Rom. I take thee at thy word.    Call me but love, and I'll be new baptiz'd;    Henceforth I never will be Romeo.  Jul. What man art thou that, thus bescreen'd in night,    So stumblest on my counsel?  Rom. By a name    I know not how to tell thee who I am.    My name, dear saint, is hateful to myself,    Because it is an enemy to thee.    Had I it written, I would tear the word.  Jul. My ears have yet not drunk a hundred words    Of that tongue's utterance, yet I know the sound.    Art thou not Romeo, and a Montague?  Rom. Neither, fair saint, if either thee dislike.  Jul. How cam'st thou hither, tell me, and wherefore?    The orchard walls are high and hard to climb,    And the place death, consid

## Computing Word Frequency Statistics

- Learn how to compute some basic word frequency statistics
- Use the word_stats function to compare different translations of the same book

In [48]:
def word_stat(word_counts):
    """Return number of unique words and word frequencies."""
    num_unique = len(word_counts)
    counts = word_counts.values()
    return (num_unique, counts)

In [49]:
text = read_book("resource/books/English/shakespeare/Romeo and Juliet.txt")
word_counts = count_words(text)
(num_unique, counts) = word_stat(word_counts)

In [50]:
print(num_unique, sum(counts))

5118 40776


In [51]:
text = read_book("resource/books/German/shakespeare/Romeo und Julia.txt")
word_counts = count_words(text)
(num_unique, counts) = word_stat(word_counts)

In [52]:
print(num_unique, sum(counts))

7527 20311


## Reading Multiple Files

- Learn how to navigate file directories and read in multiple files/books at once
- Get a brief introduction to pandas, which provides additional data structure and data analysis functionalities for Python

In [54]:
import os
book_dir = "./resource/books/"

import pandas as pd
stats = pd.DataFrame(columns=("language", "author", "title", "length", "unique"))
title_num = 1
for language in os.listdir(book_dir):
    for author in os.listdir(book_dir + "/" + language):
        for title in os.listdir(book_dir + "/" + language + "/" + author):
            inputfile = book_dir + "/" + language + "/" + author + "/" + title
            print(inputfile)
            text = read_book(inputfile)
            (num_unique, counts) = word_stat(count_words(text))
            stats.loc[title_num] = language, author.capitalize(), title.replace(".txt", ""), sum(counts), num_unique
            title_num +=1

./resource/books//English/shakespeare/A Midsummer Night's Dream.txt
./resource/books//English/shakespeare/Hamlet.txt
./resource/books//English/shakespeare/Macbeth.txt
./resource/books//English/shakespeare/Othello.txt
./resource/books//English/shakespeare/Richard III.txt
./resource/books//English/shakespeare/Romeo and Juliet.txt
./resource/books//English/shakespeare/The Merchant of Venice.txt
./resource/books//French/chevalier/L'a╠èle de sable.txt
./resource/books//French/chevalier/L'enfer et le paradis de l'autre monde.txt
./resource/books//French/chevalier/La capitaine.txt
./resource/books//French/chevalier/La fille des indiens rouges.txt
./resource/books//French/chevalier/La fille du pirate.txt
./resource/books//French/chevalier/Le chasseur noir.txt
./resource/books//French/chevalier/Les derniers Iroquois.txt
./resource/books//French/de Maupassant/Boule de Suif.txt
./resource/books//French/de Maupassant/Claire de Lune.txt
./resource/books//French/de Maupassant/Contes de la Becasse.tx

In [55]:
type(os.listdir)

builtin_function_or_method

In [58]:
stats

Unnamed: 0,language,author,title,length,unique
1,English,Shakespeare,A Midsummer Night's Dream,16103,4345
2,English,Shakespeare,Hamlet,28551,6776
3,English,Shakespeare,Macbeth,16874,4780
4,English,Shakespeare,Othello,26590,5898
5,English,Shakespeare,Richard III,48315,5449
6,English,Shakespeare,Romeo and Juliet,40776,5118
7,English,Shakespeare,The Merchant of Venice,20949,4978
8,French,Chevalier,L'a╠èle de sable,73801,18989
9,French,Chevalier,L'enfer et le paradis de l'autre monde,40827,10831
10,French,Chevalier,La capitaine,46306,13083


## Plotting Book Statistics

- Let's use matplotlib.pyplot to plot the book length and unique word statistics

In [59]:
import matplotlib.pyplot as plt
%matplotlib notebook
plt.plot(stats.length, stats.unique, "bo")

plt.loglog(stats.length, stats.unique, "bo")

stats[stats.language == "English"]

plt.figure(figsize = (10,10))
subset = stats[stats.language == "English"]
plt.loglog(subset.length, subset.unique, "o", label="English", color="crimson")

subset = stats[stats.language == "French"]
plt.loglog(subset.length, subset.unique, "o", label="French", color="forestgreen")

subset = stats[stats.language == "German"]
plt.loglog(subset.length, subset.unique, "o", label="German", color="orange")

subset = stats[stats.language == "Portuguese"]
plt.loglog(subset.length, subset.unique, "o", label="Portuguese", color="blueviolet")
plt.legend()
plt.xlabel("Book Length")
plt.ylabel("Number of unique words")
plt.savefig("lang_plot.pdf")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Introduction to kNN Classification

## Finding the Distance Between Two Points

- how to use Python to determine the Euclidean distance between two points expressed as NumPy arrays

In [60]:
import numpy as np
import random

In [61]:
def distance (p1, p2):
    """It finds the distance between point p1 and p2"""
    return np.sqrt(np.sum (np.power(p2 - p1, 2)))

p1 = np.array([1,1])
p2 = np.array([4,4])
distance (p1, p2)

4.242640687119285

## Majority Vote

- how to find the most common vote in an array or sequence of votes
- Compare two different methods for finding the most common vote


__A Note on Terminology__ <br>
Note that while this method is commonly called "majority vote," what is actually determined is the plurality vote, because the most common vote does not need to represent a majority of votes. We have used the standard naming convention of majority vote here.

In [65]:
def majority_vote(votes):
    """Return most common elements in votes"""
    vote_counts = {}
    for vote in votes:
        if vote in vote_counts:
            vote_counts[vote] += 1
        else:
            vote_counts[vote] = 1
            
    winners = []
    max_count = max(vote_counts.values())
    for vote, count in vote_counts.items():
        if count == max_count:
            winners.append(vote)
            
    return random.choice(winners) 

votes = [1,2,3,1,2,3,3,3,3,1]
vote_counts = majority_vote(votes)
#max_counts = max(vote_counts.values())

In [67]:
votes = [1,2,3,1,2,3,3,3,3,1]
winner = majority_vote(votes)

In [68]:
votes = [1,2,3,1,2,3,3,3,3,1,2,2,2,2]
winner = majority_vote(votes)

In [69]:
import scipy.stats as ss

In [70]:
def majority_vote_short(votes):
    """Return most common elements in votes"""
    mode, count = ss.mstats.mode(votes)
    return mode

In [71]:
votes = [1,2,3,1,2,3,3,3,3,1,2,2,2,2]
winner = majority_vote_short(votes)

## Finding Nearest Neighbors
 - Learn how to find the nearest neighbors of an observation
 - Use the nearest neighbors to predict the class of an observatio

In [73]:
# loop over all points 
# compute the distance between point p and every other point 
# sort distances and return those K points that are nearest to point p

points = np.array([[1,1], [1,2], [1,3], [2,1], [2,2], [2,3], [3,1], [3,2], [3,3]])

In [74]:
p = np.array([2.5,2])

In [76]:
plt.plot(points[:, 0], points[:, 1], 'ro')
plt.plot(p[0], p[1], 'bo')
plt.axis([0.5, 3.5, 0.5, 3.5])

(0.5, 3.5, 0.5, 3.5)

In [77]:
def find_nearest_neighbors(p, points, k=5):
    """Find the K nearest neighbors of p and return their indices."""
    distances = np.zeros(points.shape[0])

    for i in range(len(distances)):
        distances[i] = distance(p, points[i])
    index = np.argsort(distances)
    return index[:k]

In [79]:
points[4]

array([2, 2])

In [81]:
points[7]

array([3, 2])

In [84]:
# index= np.argsort(distances)
# distances[index]
# distances[index[0:2]]

In [85]:
ind = find_nearest_neighbors(p, points, 2)

In [86]:
print(points[ind])

[[2 2]
 [3 2]]


In [87]:
ind = find_nearest_neighbors(p, points, 3)

In [88]:
ind = find_nearest_neighbors(p, points, 4)

In [89]:
def knn_predict(p, points, outcomes, k=5):
    """"""
    # Find K nearest neighbors and
    index = find_nearest_neighbors(p, points, k)
    # predict the class of p based on the majority vote
    return majority_vote(outcomes[index])

In [90]:
outcomes = np.array([0,0,0,0,1,1,1,1,1])
len(outcomes)

9

In [91]:
knn_predict(np.array([2.5, 2.7]), points, outcomes, k=2)

1

In [92]:
knn_predict(np.array([1.0, 2.7]), points, outcomes, k=2)

0

## Generating Synthetic Data

- Learn how to __generate synthetic data__

In [93]:
ss.norm(0,1).rvs((5,2))

array([[-0.178384  ,  0.31579546],
       [-1.42652098,  0.72675855],
       [-1.22004402,  1.23635734],
       [-0.66874769, -1.03164964],
       [-1.3092162 ,  0.06046696]])

In [94]:
ss.norm(1,1).rvs((5,2))

array([[ 0.71325714, -0.61577579],
       [ 1.46257362,  1.14911081],
       [ 1.94069974,  1.97227475],
       [ 3.70221528,  1.42982781],
       [ 0.90557276, -0.84832795]])

In [95]:
def generate_synth_data(n=50):
    """Create two sets of points from bivariate normal distributions"""
    points = np.concatenate((ss.norm(0,1).rvs((n,2)), ss.norm(1,1).rvs((n,2))), axis=0)
    outcomes = np.concatenate((np.repeat(0, n), np.repeat(1, n)))
    return (points, outcomes)

In [96]:
n = 20
points, outcomes = generate_synth_data(n)

In [97]:
plt.figure()
plt.plot(points[:n, 0], points[:n, 1], "ro")
plt.plot(points[n:, 0], points[n:, 1], "bo")
plt.savefig('bivardata.pdf')

<IPython.core.display.Javascript object>

## Making a Prediction Grid
- Learn how to make a __prediction grid__
- Learn how to use enumerate
- Learn how to use NumPy meshgrid

In [111]:
def make_prediction_grid(predictors, outcomes, limits, h, k):
    """
    Classify each points on the predictions grid.
    """
    (x_min, x_max, y_min, y_max) = limits
    xs = np.arange(x_min, x_max, h)
    ys = np.arange(y_min, y_max, h)
    xx, yy =  np.meshgrid(xs, ys)
    
    prediction_grid = np.zeros(xx.shape, dtype=int)
    for i,x in enumerate(xs):
        for j,y in enumerate(ys):
            p = np.array([x,y])
            prediction_grid[i,j] = knn_predict(p, predictors, outcomes, k)
            
    return (xx, yy, prediction_grid)

In [112]:
seasons = ["spring", "summar", "fall", "winner"]

In [113]:
list(enumerate(seasons))

[(0, 'spring'), (1, 'summar'), (2, 'fall'), (3, 'winner')]

In [114]:
for ind, season in enumerate(seasons):
    print(ind, season)

0 spring
1 summar
2 fall
3 winner


## Plotting the Prediction Grid

- Learn how to __plot the prediction grid__
- Learn about the __bias-variance tradeoff__

In [122]:
def plot_prediction_grid (xx, yy, prediction_grid, filename):
    """ Plot KNN predictions for every point on the grid."""
    from matplotlib.colors import ListedColormap
    background_colormap = ListedColormap (["hotpink","lightskyblue", "yellowgreen"])
    observation_colormap = ListedColormap (["red","blue","green"])
    plt.figure(figsize =(10,10))
    plt.pcolormesh(xx, yy, prediction_grid, cmap = background_colormap, alpha = 0.5)
    plt.scatter(predictors[:,0], predictors [:,1], c = outcomes, cmap = observation_colormap, s = 50)
    plt.xlabel('Variable 1'); plt.ylabel('Variable 2')
    plt.xticks(()); plt.yticks(())
    plt.xlim (np.min(xx), np.max(xx))
    plt.ylim (np.min(yy), np.max(yy))
    plt.savefig(filename)

In [123]:
(predictors, outcomes) = generate_synth_data()
k=5
filename="knn_synth_5.pdf" 
limits=(-3, 4,-3, 4) 
h = 0.1
(xx, yy, prediction_grid) = make_prediction_grid(predictors, outcomes, limits, h, k)
plot_prediction_grid(xx, yy, prediction_grid, filename) 

<IPython.core.display.Javascript object>

In [124]:
(predictors, outcomes) = generate_synth_data()
k=50
filename="knn_synth_50.pdf" 
limits=(-3, 4,-3, 4) 
h = 0.1
(xx, yy, prediction_grid) = make_prediction_grid(predictors, outcomes, limits, h, k)
plot_prediction_grid(xx, yy, prediction_grid, filename) 

<IPython.core.display.Javascript object>

## Applying the kNN Method

- Learn how to __apply the homemade kNN classifier__ to a real dataset
- __Compare__ the performance of the homemade kNN classifier to the performance of the kNN classifier from the scikit-learn module

In [125]:
from sklearn import datasets
iris = datasets.load_iris()

In [126]:
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [127]:
predictors = iris.data[:, 0:2]
outcomes = iris.target

plt.plot(predictors[outcomes==0][:,0], predictors[outcomes==0][:,1], "ro")
plt.plot(predictors[outcomes==1][:,0], predictors[outcomes==1][:,1], "go")
plt.plot(predictors[outcomes==2][:,0], predictors[outcomes==2][:,1], "bo")
plt.savefig("iris.pdf")

In [131]:
# k=5
# filename="iris_grid.pdf" 
# limits=(4, 8, 1.5, 4.5) 
# h = 0.1
# (xx, yy, prediction_grid) = make_prediction_grid(predictors, outcomes, limits, h, k)
# plot_prediction_grid(xx, yy, prediction_grid, filename) 

In [135]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(predictors, outcomes)
sk_predictions = knn.predict(predictors)

In [137]:
sk_predictions.shape

(150,)

In [138]:
my_predictions = np.array([knn_predict(p, predictors, outcomes, 5) for p in predictors])

In [139]:
my_predictions.shape

(150,)

In [140]:
print(100* np.mean(sk_predictions == my_predictions))

96.0
