# INTRODUCTION

In this problem set, you will use Python and pylab to write a agglomerative hierarchical clustering algorithm. You will use your algorithm to cluster cities across the United States according to some information available about each.  

## GETTING STARTED

Download: Problem Set 6 skeleton code.

The problem set contains two files: 

`cityTemps.txt` - a text file containing information about varous cities in the United States. The first few lines starting with the hash (#) represent the column titles of the data available. The lines without the hash represent the actual data, separated by a comma -- in this case, cities and their average temperature in Jan, average temperature in April, average temperature in July, average temperature in October, annual precipitation in inches, and the number of days of precipitation in a year.

`clusterCities.py` - a file containing some useful classes and a partial implementation of the Cluster and ClusterSet classes. In particular, this file contains:
* A function, `scaleFeatures`, that can be used to scale the dataset features. Scaling essentially normalizes the values to be between 0 and 1 so that certain features do not overwhelm others
* Functions to read the data in `cityTemps.txt` and produce a list of objects of type City from the data in the file.
* Classes `Point` and `City`
* An incomplete implementation of class `Cluster`. You will implement the functions `singleLinkageDist`, `maxLinkageDist`, `averageLinkageDist`.
* An incomplete implementation of class `ClusterSet`. You will implement `mergeCluster`, `findClosest`, `mergeOne`.
* A function that uses classes to implement the clustering and accumulate the results.
* A function `hCluster` to test the program.

In [1]:
import pylab, string

def stdDev(X):
    mean = sum(X)/float(len(X))
    tot = 0.0
    for x in X:
        tot += (x - mean)**2
    return (tot/len(X))**0.5

def scaleFeatures(vals):
    """Assumes vals is a sequence of numbers"""
    result = pylab.array(vals)
    mean = sum(result)/float(len(result))
    result = result - mean
    sd = stdDev(result)
    result = result/sd
    return result

class Point(object):
    def __init__(self, name, originalAttrs):
        """originalAttrs is an array"""
        self.name = name
        self.attrs = originalAttrs
        
    def dimensionality(self):
        return len(self.attrs)
    
    def getAttrs(self):
        return self.attrs
    
    def distance(self, other):
        #Euclidean distance metric
        result = 0.0
        for i in range(self.dimensionality()):
            result += (self.attrs[i] - other.attrs[i])**2
        return result**0.5
    
    def getName(self):
        return self.name
    
    def toStr(self):
        return self.name + str(self.attrs)
    
    def __str__(self):
        return self.name

In [2]:
#City climate example
class City(Point):
    pass

def readCityData(fName, scale = False):
    """Assumes scale is a Boolean.  If True, features are scaled"""
    dataFile = open(fName, 'r')
    numFeatures = 0
    #Process lines at top of file
    for line in dataFile: #Find number of features
        if line[0:4] == '#end': #indicates end of features
            break
        numFeatures += 1
    numFeatures -= 1
    
    #Produce featureVals, cityNames
    featureVals, cityNames = [], []
    for i in range(numFeatures):
        featureVals.append([])
        
    #Continue processing lines in file, starting after comments
    for line in dataFile:
        dataLine = line.strip().split(',') #remove newline; then split
        cityNames.append(dataLine[0])
        for i in range(numFeatures):
            featureVals[i].append(float(dataLine[i+1]))
            
    #Use featureVals to build list containing the feature vectors
    #For each city scale features, if needed
    if scale:
        for i in range(numFeatures):
            featureVals[i] = scaleFeatures(featureVals[i])
    featureVectorList = []
    for city in range(len(cityNames)):
        featureVector = []
        for feature in range(numFeatures):
            featureVector.append(featureVals[feature][city])
        featureVectorList.append(featureVector)
    return cityNames, featureVectorList

def buildCityPoints(fName, scaling):
    cityNames, featureList = readCityData(fName, scaling)
    points = []
    for i in range(len(cityNames)):
        point = City(cityNames[i], pylab.array(featureList[i]))
        points.append(point)
    return points

#Use hierarchical clustering for cities
def hCluster(points, linkage, numClusters, printHistory):
    cS = ClusterSet(City)
    for p in points:
        cS.add(Cluster([p], City))
    history = []
    while cS.numClusters() > numClusters:
        merged = cS.mergeOne(linkage)
        history.append(merged)
    if printHistory:
        print ''
        for i in range(len(history)):
            names1 = []
            for p in history[i][0].members():
                names1.append(p.getName())
            names2 = []
            for p in history[i][1].members():
                names2.append(p.getName())
            print 'Step', i, 'Merged', names1, 'with', names2
            print ''
    print 'Final set of clusters:'
    print cS.toStr()
    return cS

In [18]:
# Testing loading file
points = buildCityPoints('ProblemSet6/test.txt', False)

# Problem 1 - Linkage Criteria

(10 points possible)<br>
In this problem, you will implement three different linkage criteria: `singleLinkageDist`, `maxLinkageDist`, and `averageLinkageDist`. For our purposes, distances between elements will be calculated using the `Point` class distance method, which calculates the Euclidean distance.

The `singleLinkageDist` between two clusters is the shortest distance between an element in one cluster to an element in the other cluster. In other words, the distance will be that between the points that are closest to each other, where one point is from one cluster and the other is from the other cluster.
The `maxLinkageDist` between two clusters is the largest distance between an element in one cluster to an element in the other cluster. In other words, the distance will be that between the points that are farthest from each other, where one point is from one cluster and the other is from the other cluster.
The `averageLinkageDist` between two clusters uses the mean to find the average distance between all possible pais of elements (`p1`, `p2`) where `p1` is from one cluster and `p2` is from the other cluster.

Enter all code for the `Cluster` class below, including the functions in this class that were already defined for you.

In [7]:
class Cluster(object):
    """ A Cluster is defined as a set of elements, all having 
    a particular type """
    def __init__(self, points, pointType):
        """ Elements of a cluster are saved in self.points
        and the pointType is also saved """
        self.points = points
        self.pointType = pointType
        
    def singleLinkageDist(self, other):
        """ Returns the float distance between the points that 
        are closest to each other, where one point is from 
        self and the other point is from other. Uses the 
        Euclidean dist between 2 points, defined in Point."""
        dist = float("inf")
        for point in self.points:
            for otherPoint in other.points:
                currentDist = point.distance(otherPoint)
                if currentDist < dist:
                    dist = currentDist
        return dist
    
    def maxLinkageDist(self, other):
        """ Returns the float distance between the points that 
        are farthest from each other, where one point is from 
        self and the other point is from other. Uses the 
        Euclidean dist between 2 points, defined in Point."""
        dist = 0.0
        for point in self.points:
            for otherPoint in other.points:
                currentDist = point.distance(otherPoint)
                if currentDist > dist:
                    dist = currentDist
        return dist
    
    def averageLinkageDist(self, other):
        """ Returns the float average (mean) distance between all 
        pairs of points, where one point is from self and the 
        other point is from other. Uses the Euclidean dist 
        between 2 points, defined in Point."""
        dists = []
        for p1 in self.points:
            for p2 in other.points:
                dists.append(p2.distance(p1))

        if len(dists) > 0:
            return sum(dists) / float(len(dists))
        return 0.0
        
    def members(self):
        for p in self.points:
            yield p
            
    def isIn(self, name):
        """ Returns True is the element named name is in the cluster
        and False otherwise """
        for p in self.points:
            if p.getName() == name:
                return True
        return False
    
    def toStr(self):
        result = ''
        for p in self.points:
            result = result + p.toStr() + ', '
        return result[:-2]
    
    def getNames(self):
        """ For consistency, returns a sorted list of all 
        elements in the cluster """
        names = []
        for p in self.points:
            names.append(p.getName())
        return sorted(names)
    
    def __str__(self):
        names = self.getNames()
        result = ''
        for p in names:
            result = result + p + ', '
        return result[:-2]

## Problem 2 - Merging Clusters

(10 points possible)<br>
In this problem, you will finish implementing the `ClusterSet` class by writing code for the three missing functions: `mergeClusters`, `findClosest`, and `mergeOne`.

* `mergeClusters` will create a new cluster containing the union of the points in `c1` and points in `c2`. This new cluster will be added to the cluster set, while `c1` and `c2` are removed from the cluster set. This funcion does not return anything.

* `findClosest` will use the "linkage" parameter to find the distance between two clusters. It will iterate over all pairs of clusters in the cluster set and return the tuple (`c1`,`c2`) of the clusters within the cluster set that are closest. Note that no matter what linkage criteria we are using, we will always return the cluster pairs that are closest to each other.

* `mergeOne` will make use of `findClosest` to determine which pairs of clusters to merge. Then, it will use mergeClusters to perform the merging on these two closest clusters. This function returns the tuple (`c1`,`c2`) representing the clusters that were merged.

To test how your code clusters the city data, you may use the `hCluster` function and uncomment the line `#test()` to run the hierarchical clustering algorithm. It may take up to a minute to cluster, so be patient. Notice that the last parameter of `hCluster` is a history flag. If toggled, it will print out more detail, in particular which clusters are merged at each step. During testing, you may also want to make up a new datafile that contains less datapoints, less features, and easier numbers to work with.

**Hint**: A simpler datafile and sample output

Below is a simpler datafile (`test.txt`). As with `cityTemps.txt`, the first line represents the name of a point. The lines after that, up until before #end represent how many features will correspond to each point and they are represented by numbers on the same line as the point name, comma delimited. The line #end represents the end of the column titles and the beginning of the datapoints. You must have an empty line at the end of the file.

```
#point_name
#feature_value1
#feature_value2
#end
a,3,1
b,6,2
c,6,5
d,6,2
e,5,5
f,1,4
g,5,8
```

Appropriate test lines with this data would be:

```
points = buildCityPoints('test.txt', False)
hCluster(points, Cluster.singleLinkageDist, 3, False)
hCluster(points, Cluster.maxLinkageDist, 3, False)
hCluster(points, Cluster.averageLinkageDist, 3, False)
```            

And with a correctly implemented set of functions, one possible output is shown below. For such a small number of data points and features it is possible that your output would be slightly different, depending on which point gets chosen in case of a tie.

Final set of clusters:
```
  C0:a
  C1:b, c, d, e, g
  C2:f
```

Final set of clusters:
```
  C0:a, b, d
  C1:c, e, g
  C2:f
```
Final set of clusters:
```
  C0:a, f
  C1:b, c, d, e
  C2:g
```

Enter all code for the `ClusterSet` class below, including the functions in this class that were already defined for you. Do not paste the `Cluster` class code.

In [4]:
class ClusterSet(object):
    """ A ClusterSet is defined as a list of clusters """
    def __init__(self, pointType):
        """ Initialize an empty set, without any clusters """
        self.members = []
        
    def add(self, c):
        """ Append a cluster to the end of the cluster list
        only if it doesn't already exist. If it is already in the 
        cluster set, raise a ValueError """
        if c in self.members:
            raise ValueError
        self.members.append(c)
        
    def getClusters(self):
        return self.members[:]
    
    def mergeClusters(self, c1, c2):
        """ Assumes clusters c1 and c2 are in self
        Adds to self a cluster containing the union of c1 and c2
        and removes c1 and c2 from self """
        self.add(Cluster(c1.points + c2.points, c1.pointType))
        self.members.remove(c1)
        self.members.remove(c2)        
    
    def findClosest(self, linkage):
        """ Returns a tuple containing the two most similar 
        clusters in self
        Closest defined using the metric linkage """
        bestDist = linkage(self.members[0], self.members[1])
        group = (self.members[0], self.members[1])
        for c1 in self.members:
            for c2 in self.members:
                if c1 == c2:
                    continue
                dist = linkage(c1,c2)
                if dist < bestDist:
                    bestDist = dist
                    group = (c1, c2)
        return group    
    
    def mergeOne(self, linkage):
        """ Merges the two most simililar clusters in self
        Similar defined using the metric linkage
        Returns the clusters that were merged """
        if len(self.members) == 1:
            return None
        if len(self.members) == 2:
            return self.mergeClusters(self.members[0], self.members[1])
        c1, c2 = self.findClosest(linkage)
        self.mergeClusters(c1,c2)
        return (c1, c2)
    
    def numClusters(self):
        return len(self.members)
    
    def toStr(self):
        cNames = []
        for c in self.members:
            cNames.append(c.getNames())
        cNames.sort()
        result = ''
        for i in range(len(cNames)):
            names = ''
            for n in cNames[i]:
                names += n + ', '
            names = names[:-2]
            result += '  C' + str(i) + ':' + names + '\n'
        return result

In [25]:
# Test with a small version of the data
points = buildCityPoints('ProblemSet6/test.txt', False)
hCluster(points, Cluster.singleLinkageDist, 3, False)
hCluster(points, Cluster.maxLinkageDist, 3, False)
hCluster(points, Cluster.averageLinkageDist, 3, False)

Final set of clusters:
  C0:a
  C1:b, c, d, e, g
  C2:f

Final set of clusters:
  C0:a, b, d
  C1:c, e, g
  C2:f

Final set of clusters:
  C0:a, b, d
  C1:c, e, g
  C2:f



<__main__.ClusterSet at 0x7fbe5f32a850>

## Problem 3-1

(1/1 point)<br>
Play around with your code on the cityTemps.txt data. Try clustering into a different number of clusters, using different linkage criteria, and with or without scaling the data. Answer the following questions:

When clustering without scaling and with the total number of clusters 10, which cities end up in a cluster by themselves?

 * Honolulu and Fairbanks [Answer] 
 * Anchorage and Olympia  
 * LasVegas and SanFrancisco 
 * Duluth and Miami

In [26]:
# Testing with 10 clusters
points = buildCityPoints('ProblemSet6/cityTemps.txt', False)
hCluster(points, Cluster.maxLinkageDist, 10, False)
hCluster(points, Cluster.averageLinkageDist, 10, False)
hCluster(points, Cluster.singleLinkageDist, 10, False)

Final set of clusters:
  C0:Albany, Asheville, AtlantiCity, Baltimore, Boston, Bridgeport, Charlotte, Chicago, Columbus, Concord, DesMoines, Detroit, Hartford, Indianapolis, KansasCity, Knoxville, Lexington, Louisville, Milwaukee, NewYork, Newark, Philadelphia, PortlandME, Providence, Raleigh, Richmond, Springfield, StLouis, Toledo, Washington, Wilmington
  C1:Albuquerque, DodgeCity, GrandJunction, Reno, Sacramento, SanFrancisco
  C2:Anchorage, Billings, Bismarck, Boise, Casper, Cheyenne, Denver, Fargo, Helena, Madison, Minneapolis, Omaha, SaltLakeCity, SiouxFalls, Spokane
  C3:Atlanta, BatonRouge, Birmingham, Charleston, Columbia, Houston, Jackson, Jacksonville, LittleRock, Memphis, Miami, Mobile, Montgomery, Nashville, NewOrleans, Norfolk, Savannah, Tampa, VeroBeach
  C4:Austin, Dallas, OklahomaCity, SanAntonio, Tulsa, Wichita
  C5:Buffalo, Charleston, Cleveland, Olympia, Pittsburgh, PortlandOR, Seattle
  C6:Burlington, Caribou, Duluth, GrandRapids
  C7:ElPaso, LasVegas, LongBeach, L

<__main__.ClusterSet at 0x7fbe80151990>

## Problem 3-2

(1 point possible)<br>
When clustering the data into 5 clusters using single linkage criteria, which city is in a cluster by itself when using scaling but is not in a cluster by itself when not using scaling?

 * LosAngeles
 * Anchorage
 * SanFrancisco [Answer]
 * SanDiego

In [31]:
# Testing 5 clusters
points = buildCityPoints('ProblemSet6/cityTemps.txt', True)
hCluster(points, Cluster.singleLinkageDist, 5, False)
points = buildCityPoints('ProblemSet6/cityTemps.txt', False)
hCluster(points, Cluster.singleLinkageDist, 5, False)

Final set of clusters:
  C0:Albany, Albuquerque, Asheville, Atlanta, AtlantiCity, Austin, Baltimore, BatonRouge, Billings, Birmingham, Bismarck, Boise, Boston, Bridgeport, Buffalo, Burlington, Caribou, Casper, Charleston, Charleston, Charlotte, Cheyenne, Chicago, Cleveland, Columbia, Columbus, Concord, Dallas, Denver, DesMoines, Detroit, DodgeCity, Duluth, ElPaso, Fargo, GrandJunction, GrandRapids, Hartford, Helena, Houston, Indianapolis, Jackson, Jacksonville, KansasCity, Knoxville, LasVegas, Lexington, LittleRock, LongBeach, LosAngeles, Louisville, Madison, Memphis, Miami, Milwaukee, Minneapolis, Mobile, Montgomery, Nashville, NewOrleans, NewYork, Newark, Norfolk, OklahomaCity, Olympia, Omaha, Philadelphia, Phoenix, Pittsburgh, PortlandME, PortlandOR, Providence, Raleigh, Reno, Richmond, Roswell, Sacramento, SaltLakeCity, SanAntonio, SanDiego, Savannah, Seattle, SiouxFalls, Spokane, Springfield, StLouis, Tampa, Toledo, Tucson, Tulsa, VeroBeach, Washington, Wichita, Wilmington
  C1:An

<__main__.ClusterSet at 0x7fbe63c10fd0>

## Problem 3-3

(1 point possible)<br>
In this example, scaling reduces the relative importance of days of precipitation.

 * True
 * False