# Learning Data Science w/ Python 3

Reference material: [Python for Data Science](http://nbviewer.jupyter.org/github/gumption/Python_for_Data_Science/blob/master/Python_for_Data_Science_all.ipynb) by Joe McCarthy.

I'm not a Python beginner, however for the sake of convenience, I shall follow Joe McCarthy's notebook closely as he teaches Data Science with a primer on Python. Note that I will be using Python 3, in contrast to McCarthy's Python 2 tutorial.

Good luck, myself.

## Start

In [1]:
single_instance_str = 'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'
single_instance_str

'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'

Okay, I don't have to write *all* the introductory Python code. I'll still be writing all definitions of any kind.

In [2]:
single_instance_list = single_instance_str.split(',')
print(single_instance_list)

['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']


In [3]:
attribute_names = [
    'class', 
    'cap-shape', 'cap-surface', 'cap-color', 
    'bruises?', 
    'odor', 
    'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 
    'stalk-shape', 'stalk-root', 
    'stalk-surface-above-ring', 'stalk-surface-below-ring', 
    'stalk-color-above-ring', 'stalk-color-below-ring',
    'veil-type', 'veil-color', 
    'ring-number', 'ring-type', 
    'spore-print-color', 
    'population', 
    'habitat'
]
print(attribute_names)

['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']


In [4]:
def attribute_value(instance, attribute, attribute_names):
    """Returns the value of attribute in instance, based on its position in attribute_names."""
    if attribute not in attribute_names:
        return None
    else:
        i = attribute_names.index(attribute)
        return instance[i]

Another way to do 66 from McCarthy's notebook:

In [5]:
for attr_name, attr_val in zip(attribute_names, single_instance_list):
    print(attr_name, '=', attr_val)

class = p
cap-shape = k
cap-surface = f
cap-color = n
bruises? = f
odor = n
gill-attachment = f
gill-spacing = c
gill-size = n
gill-color = w
stalk-shape = e
stalk-root = ?
stalk-surface-above-ring = k
stalk-surface-below-ring = y
stalk-color-above-ring = w
stalk-color-below-ring = n
veil-type = p
veil-color = w
ring-number = o
ring-type = e
spore-print-color = w
population = v
habitat = d


### Solution to Exercise 1

In [6]:
def print_attribute_names_and_values(instance, attribute_names):
    """Prints the attribute names and values for an instance."""
    print('Values for the', len(attribute_names), 'attributes:\n')
    for attr_name, attr_val in zip(attribute_names, instance):
        print(attr_name, '=', attr_val)

print_attribute_names_and_values(single_instance_list, attribute_names)

Values for the 23 attributes:

class = p
cap-shape = k
cap-surface = f
cap-color = n
bruises? = f
odor = n
gill-attachment = f
gill-spacing = c
gill-size = n
gill-color = w
stalk-shape = e
stalk-root = ?
stalk-surface-above-ring = k
stalk-surface-below-ring = y
stalk-color-above-ring = w
stalk-color-below-ring = n
veil-type = p
veil-color = w
ring-number = o
ring-type = e
spore-print-color = w
population = v
habitat = d


**End of solution.**

### Solution to Exercise 2

In [7]:
def load_instances(filename):
    """Returns a list of instances stored in a file.
    
    filename is expected to have a series of comma-separated attribute values per line, e.g.,
        p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d
    """
    instances = []
    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            instances.append(line.strip().split(','))

    return instances

data_file = 'agaricus-lepiota.data'
all_instances = load_instances(data_file)
print('Read', len(all_instances), 'instances from', data_file)
print('First instance:', all_instances[0])

Read 8124 instances from agaricus-lepiota.data
First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']


*Note that I have the file `agaricus-lepiota.data` in the same directory as this notebook.*

**End of solution.**

In [8]:
UNKNOWN_VALUE = '?'

clean_instances = [inst for inst in all_instances
                   if UNKNOWN_VALUE not in inst]

print(len(clean_instances), 'clean instances')

5644 clean instances


Just FYI, this is how you do shell stuff here:

In [9]:
! pwd
! echo
! cat agaricus-lepiota.attributes

/home/trk/learn/data-science-gumption-python

class: edible=e, poisonous=p
cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s
cap-color: brown=n ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
bruises?: bruises=t, no=f
odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
gill-attachment: attached=a, descending=d, free=f, notched=n
gill-spacing: close=c, crowded=w, distant=d
gill-size: broad=b, narrow=n
gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
stalk-shape: enlarging=e, tapering=t
stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s
stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s
stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, 

### Solution to Exercise 3

In [10]:
def load_attribute_names_and_values(filename):
    """Returns a list of attribute names and values in a file.
    
    This list contains dictionaries wherein the keys are names 
    and the values are value description dictionariess.
    
    Each value description sub-dictionary will use 
    the attribute value abbreviations as its keys 
    and the attribute descriptions as the values.
    
    filename is expected to have one attribute name and set of values per line, 
    with the following format:
        name: value_description=value_abbreviation[,value_description=value_abbreviation]*
    for example
        cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
    The attribute name and values dictionary created from this line would be the following:
        {'name': 'cap-shape', 
         'values': {'c': 'conical', 
                    'b': 'bell', 
                    'f': 'flat', 
                    'k': 'knobbed', 
                    's': 'sunken', 
                    'x': 'convex'}}
    """
    attribute_names_and_values = list()
    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            attribute_dict = dict()
            attr_name = line.split(':')[0]
            attribute_dict['name'] = attr_name

            val_dict = dict()
            val_string = line.split(':')[1].strip()
            val_list = [s.strip() for s in val_string.split(',')]
            val_list = [s.split('=') for s in val_list]
            for l in val_list:
                val_dict[l[1]] = l[0]

            attribute_dict['values'] = val_dict
            attribute_names_and_values.append(attribute_dict)

    return attribute_names_and_values

attribute_file = 'agaricus-lepiota.attributes'
attribute_names_and_values = load_attribute_names_and_values(attribute_file)
print('Read', len(attribute_names_and_values), 'attribute values from', attribute_file)
print('First attribute name:', attribute_names_and_values[0]['name'], 
      '; values:', attribute_names_and_values[0]['values'])

Read 23 attribute values from agaricus-lepiota.attributes
First attribute name: class ; values: {'e': 'edible', 'p': 'poisonous'}


**End of solution.**

### Solution to Exercise 4

In [11]:
from collections import Counter

def attribute_value_counts(instances, attribute, attribute_names):
    """Returns a Counter for each value of attribute in instances."""
    if attribute in attribute_names:
        return Counter([
            attribute_value(instance, attribute, attribute_names)
            for instance in instances
        ])

    return None

attribute = 'cap-shape'
attribute_value_counts = attribute_value_counts(
    clean_instances, 
    attribute, 
    attribute_names
)

print('Counts for each value of', attribute, ':')
for value in attribute_value_counts:
    print(value, ':', attribute_value_counts[value])

Counts for each value of cap-shape :
x : 2840
b : 300
s : 32
f : 2432
k : 36
c : 4


**End of solution.**

`sorted(dict)` returns a sorted list of the dict's keys.

In [12]:
# Redefining the above function again because of a variable of the same name
def attribute_value_counts(instances, attribute, attribute_names):
    """Returns a Counter for each value of attribute in instances."""
    if attribute in attribute_names:
        return Counter([
            attribute_value(instance, attribute, attribute_names)
            for instance in instances
        ])

    return None

attribute = 'cap-shape'
attribute_value_counts_ = attribute_value_counts(
    clean_instances, 
    attribute, 
    attribute_names
)

print('Counts for each value of', attribute, ':')
for value in sorted(attribute_value_counts_):
    print(value, ':', attribute_value_counts_[value])

Counts for each value of cap-shape :
b : 300
c : 4
f : 2432
k : 36
s : 32
x : 2840


I did not realize before that sorting a `dict` by its values is not a straightforward task. To do it, we could use the `itemgetter(i)` function from the **`operator`** module. This is how the sorting is done in McCarthy's notebook (96).

However in Python 3, we can use a `lambda` function to specify as a key. Functional programming FTW! For more info on this, see [this answer](https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value#613218 "How do I sort a dictionary by value?") on Stack Overflow.

In [13]:
attribute = 'cap-shape'
value_counts = attribute_value_counts(
    clean_instances, 
    attribute, 
    attribute_names
)

print('Counts for each value of', attribute, '(sorted by count):')
for value, count in sorted(value_counts.items(), 
                           key=lambda dikt: dikt[1], 
                           reverse=True):
    print(value, ':', count)

Counts for each value of cap-shape (sorted by count):
x : 2840
f : 2432
b : 300
k : 36
s : 32
c : 4


### Solution to Exercise 5

In [14]:
def print_all_attribute_value_counts(instances, attribute_names):
    """Prints statistics of attribute values."""
    for attr in attribute_names:
        print(attr, end=': ')
        
        attr_value_counts = attribute_value_counts(
            instances, attr, attribute_names
        )
        attr_value_total = sum(attr_value_counts.values())
        attr_value_counts = sorted(
            attr_value_counts.items(),
            key=lambda x: x[1],
            reverse=True
        )
        for value, count in attr_value_counts:
            print('{} = {} ({:.3f})'.format(value,
                                       count,
                                       count / attr_value_total),
                 end=', ')
        
        print()

print('\nCounts for all attributes and values:\n')
print_all_attribute_value_counts(clean_instances, attribute_names)


Counts for all attributes and values:

class: e = 3488 (0.618), p = 2156 (0.382), 
cap-shape: x = 2840 (0.503), f = 2432 (0.431), b = 300 (0.053), k = 36 (0.006), s = 32 (0.006), c = 4 (0.001), 
cap-surface: y = 2220 (0.393), f = 2160 (0.383), s = 1260 (0.223), g = 4 (0.001), 
cap-color: g = 1696 (0.300), n = 1164 (0.206), y = 1056 (0.187), w = 880 (0.156), e = 588 (0.104), b = 120 (0.021), p = 96 (0.017), c = 44 (0.008), 
bruises?: t = 3184 (0.564), f = 2460 (0.436), 
odor: n = 2776 (0.492), f = 1584 (0.281), a = 400 (0.071), l = 400 (0.071), p = 256 (0.045), c = 192 (0.034), m = 36 (0.006), 
gill-attachment: f = 5626 (0.997), a = 18 (0.003), 
gill-spacing: c = 4620 (0.819), w = 1024 (0.181), 
gill-size: b = 4940 (0.875), n = 704 (0.125), 
gill-color: p = 1384 (0.245), n = 984 (0.174), w = 966 (0.171), h = 720 (0.128), g = 656 (0.116), u = 480 (0.085), k = 408 (0.072), r = 24 (0.004), y = 22 (0.004), 
stalk-shape: t = 2880 (0.510), e = 2764 (0.490), 
stalk-root: b = 3776 (0.669), e =

**End of solution.**

Okay, now we're about to get into the good stuff: **building and using a simple decision tree classifier**.

### Solution to Exercise 6

In [16]:
import math

def entropy(instances):
    """Returns the entropy of instances."""
    attr = 'class'
    avc = Counter([  # avc: attribute value counts
        attribute_value(instance, attr, attribute_names)
        for instance in instances
    ])
    
    ent = 0
    sum_avc = sum(avc.values())
    for a in avc:
        ent -= (avc[a] / sum_avc) * math.log2(avc[a] / sum_avc)
    
    return ent

print(entropy(clean_instances))

0.9594413373534085


**End of solution.**