# Classification

The need to group data arises naturally in many scenario. For example, a marketing officer would need to identify which group of customers are the most worthwhile to target, and a researcher might want to categorize households based on their sociodemographic background. 

Techniques used in grouping data are separated into two broad categories:
- **Classification** - Group data based on examples you provide.
- **Clustering** - Group data without any examples. 

The former is a form of *supervised learning*, while the latter is *unsupervised learning*. Which type of techniques to use depends on your need. In the case of marketing, perhaps you already have experience on which type of customers is the most profitable, so you will go for classification. On the other hand, when it comes to categorizing households, you might want the program to automatically separate the households into groups, which would be clustering.


## A. Classification
Suppose you have the following data:

| Customer |   Address   | Spending |
|:--------:|:-----------:|:--------:|
|     1    |   Central   |   High   |
|     2    |  Admiralty  |   High   |
|     3    | North Point |    Low   |
|     4    |    Shatin   |   High   |
|     5    |    Fo Tan   |    Low   |
|     6    |  Ma On Shan |    Low   |

And you need to predict the spending of the following customer:

| Customer |   Address   | Spending |
|:--------:|:-----------:|:--------:|
|     7    |   Chai Wan  |     ?    |

How should you do so? 

Before we proceed further, let us first go over the terminology commonly used in classification. In economics we usually call each data point an *observation*, with "Spending" being a *dependent variable* and "Address" an *independent variable*. In classification it is more common to call each data point a *sample*, with "Spending" being a *class variable* and "Address" a *feature*.

In [None]:
#Data
raw_data = [
            [1,1,'Central',22.2819,114.1581,1],
            [2,1,'Admiralty',22.2796,114.1655,1],
            [3,0,'North Point',22.2871,114.1917,1],
            [4,1,'Shatin',22.3771,114.1974,0],
            [5,0,'Fo Tan',22.3969,114.1959,0],
            [6,0,'Ma On Shan',22.4221,114.2324,0],
            ]
labels = ['customer','hi_spending','address','latitude','longitude','hk_island']
data = pd.DataFrame.from_records(raw_data,columns=labels)

In [None]:
#Check data
data

### Ordinary Least Square
There are multiple ways to approach this problem. As an economics major, the first technique that comes to mind is probably the ordinary least square (OLS). Why is OLS not suitable for classification?

### Logit

A seasoned economist will likely use a logit regression to handle this task. First let us consider the district of each customer's address:

| Customer |   Address   | Hong Kong Island | High Spending |
|:--------:|:-----------:|:----------------:|:-------------:|
|     1    |   Central   |         1        |       1       |
|     2    |  Admiralty  |         1        |       1       |
|     3    | North Point |         1        |       0       |
|     4    |    Shatin   |         0        |       1       |
|     5    |    Fo Tan   |         0        |       0       |
|     6    |  Ma On Shan |         0        |       0       |
|     7    |   Chai Wan  |         1        |       ?       |

In [None]:
#Logistic regression
from sklearn.linear_model import LogisticRegression

#Get data


#Train model


#Predict


One question we might have from the above is, why is estimated probabilities not 1/3 and 2/3? This is because in data mining, constraints are added to to penalize extreme estimates. This technique is called *regularization*. Regularization is done to prevent overfitting, the phenomenon of closely fitting existing data but producing good predictions for unseen samples.

If we tune down the regularization parameter in the logistic regression, we will get predictions closer to (1/3,2/3):

In [None]:
#Logistic regression with weak regularization (high C)


#Predict
print("Prediction (high spending = 1):".ljust(35),logit.predict([[1]]))
print("Pr(High) Pr(Low):".ljust(35),logit.predict_proba([[1]]))

The above example is in effect already categorized. What if we have latitudinal and longitudinal data instead?

| Customer |   Address   | Latitude | Longitude | High Spending |
|:--------:|:-----------:|:--------:|:---------:|:-------------:|
|     1    |   Central   |  22.2819 |  114.1581 |       1       |
|     2    |  Admiralty  |  22.2796 |  114.1655 |       1       |
|     3    | North Point |  22.2871 |  114.1917 |       0       |
|     4    |    Shatin   |  22.3771 |  114.1974 |       1       |
|     5    |    Fo Tan   |  22.3969 |  114.1959 |       0       |
|     6    |  Ma On Shan |  22.4221 |  114.2324 |       0       |
|     7    |   Chai Wan  |   22.27  |   114.24  |       ?       |

In [None]:
#Training data


#Test data


#Train Model


#Predict
tw = 50
print("Model Accuracy:".ljust(tw),model.score(X2,y))
print("Chai Wan Prediction (high spending = 1):".ljust(tw),model.predict(X_ChaiWan))
print("Chai Wan Est. Prob - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_ChaiWan))
print("Shatin Prediction (high spending = 1):".ljust(tw),model.predict(X_Shatin))
print("Shatin Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_Shatin))
print("North Point Prediction (high spending = 1):".ljust(tw),model.predict(X_NorthPoint))
print("North Point Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_NorthPoint))
print("")
print("Fo Tan Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_Fotan))


### Bayes Rule
The Bayes Rule says that

$$
P(y \mid \vec{x}) = \frac{P(\vec{x} \mid y)P(y)}{P(\vec{x})}
$$
Applying to our current problem,

$$
P(spending \mid location) = \frac{P(location \mid spending)P(spending)}{P(location)}
$$
Our task is to pick a value for $spending$ that maximizes this probability:

$$
\hat{spending} = \underset{spending}{\operatorname{argmax}}  \left \{ \frac{P(location \mid spending)P(spending)}{P(location)} \right \}
$$

Notice that $P(location)$ is constant for any given location, so we can eliminate it and get

$$
\hat{spending} = \underset{spending}{\operatorname{argmax}}  \left \{ P(location \mid spending)P(spending) \right \}
$$

To solve this maximization problem we need $P(location \mid spending)$, and there are two common ways to get that.

#### i. Native Bayes
Native Bayes assumes that all the elements of $\vec{x}$ are independent, so

$$
P(y \mid \vec{x}) = P(y \mid x_1) \cdot P(y \mid x_2) \cdot P(y \mid x_3) ... 
$$

Each $P(y \mid x_i)$ is assumed to be normally distributed. The mean and standard deviations are estimated by the process.

In [None]:
#Gaussian Native Bayes
from sklearn.naive_bayes import *
model = GaussianNB()
model.fit(X2,y)

print("Model Accuracy:".ljust(tw),model.score(X2,y))
print("Chai Wan Prediction (high spending = 1):".ljust(tw),model.predict(X_ChaiWan))
print("Chai Wan Est. Prob - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_ChaiWan))
print("Shatin Prediction (high spending = 1):".ljust(tw),model.predict(X_Shatin))
print("Shatin Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_Shatin))
print("North Point Prediction (high spending = 1):".ljust(tw),model.predict(X_NorthPoint))
print("North Point Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_NorthPoint))

#### ii. Linear Discriminant Analysis (LDA)
LDA assumes that the features are correlated.

In [None]:
#Linear Discriminant Analysis
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis()
model.fit(X2,y)

print("Model Accuracy:".ljust(tw),model.score(X2,y))
print("Chai Wan Prediction (high spending = 1):".ljust(tw),model.predict(X_ChaiWan))
print("Chai Wan Est. Prob - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_ChaiWan))
print("Shatin Prediction (high spending = 1):".ljust(tw),model.predict(X_Shatin))
print("Shatin Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_Shatin))
print("North Point Prediction (high spending = 1):".ljust(tw),model.predict(X_NorthPoint))
print("North Point Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_NorthPoint))

### Support Vector Machine (SVD)

SVD looks for a boundary that separate two classes while allowing for a buffer zone where mistakes are tolerated.
<img src="http://scikit-learn.org/stable/_images/sphx_glr_plot_svm_margin_001.png">
Source: <a href="http://scikit-learn.org/stable/auto_examples/svm/plot_svm_margin.html">scikit learn</a>

In [None]:
#Support Vector Machine
from sklearn.svm import SVC
model = SVC(probability=True)
model.fit(X2,y)

print("Model Accuracy:".ljust(tw),model.score(X2,y))
print("Chai Wan Prediction (high spending = 1):".ljust(tw),model.predict(X_ChaiWan))
print("Chai Wan Est. Prob - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_ChaiWan))
print("Shatin Prediction (high spending = 1):".ljust(tw),model.predict(X_Shatin))
print("Shatin Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_Shatin))
print("North Point Prediction (high spending = 1):".ljust(tw),model.predict(X_NorthPoint))
print("North Point Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_NorthPoint))

### Nearest Neighbor
Another method we could use is to look at samples that have similar characters as the one we are trying to predict. This method is called *nearest neighbor*.

In the simpliest case, we will use the closest sample as a predictor:

In [None]:
#Nearest Neigbhor
from sklearn.neighbors import KNeighborsClassifier

#Only consider the nearest neighbor
model = KNeighborsClassifier(n_neighbors=1)
model.fit(X2,y)

print("Model Accuracy:".ljust(tw),model.score(X2,y))
print("Chai Wan Prediction (high spending = 1):".ljust(tw),model.predict(X_ChaiWan))
print("Chai Wan Est. Prob - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_ChaiWan))
print("Shatin Prediction (high spending = 1):".ljust(tw),model.predict(X_Shatin))
print("Shatin Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_Shatin))
print("North Point Prediction (high spending = 1):".ljust(tw),model.predict(X_NorthPoint))
print("North Point Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_NorthPoint))

Unlike previous methods, when we try to predict the outcome of a pre-existing sample such as Shatin, we will get the correct answer. Naturally, this is because a pre-existing sample's closest neighbor is itself.

Note that the number of neighbors is crucial. For example, suppose we use the three closest neighbors instead:

In [None]:
#Consider the three closest neighbor
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X2,y)
print("Model Accuracy:".ljust(tw),model.score(X2,y))
print("Chai Wan Prediction (high spending = 1):".ljust(tw),model.predict(X_ChaiWan))
print("Chai Wan Est. Prob - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_ChaiWan))

### Classification Tree
Classification Tree repeatedly look for cutoffs that give the best prediction at each stage.

In [None]:
#Classification Tree
from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(X2,y)

print("Model Accuracy:".ljust(tw),model.score(X2,y))
print("Chai Wan Prediction (high spending = 1):".ljust(tw),model.predict(X_ChaiWan))
print("Chai Wan Est. Prob - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_ChaiWan))
print("Shatin Prediction (high spending = 1):".ljust(tw),model.predict(X_Shatin))
print("Shatin Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_Shatin))
print("Fo Tan Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_Shatin))
print("North Point Prediction (high spending = 1):".ljust(tw),model.predict(X_NorthPoint))
print("North Point Est. Prob. - Pr(Low) Pr(High):".ljust(tw),model.predict_proba(X_NorthPoint))

#Export tree structure to PNG format
from sklearn.externals.six import StringIO  
import pydot
dotfile = StringIO()
tree.export_graphviz(model, out_file=dotfile)
pydot.graph_from_dot_data(dotfile.getvalue())[0].write_png("tree.png")

Here is the tree structure as shown in tree.png:

<img src="http://www.ticoneva.com/econ/econ4130/images/8-tree.png" width="300">

We can plot the cutoffs on a map to see how the tree works. This will take a couple of seconds to generate. Skip to the attached image below if you do not want to wait.

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

lat = data['latitude'].values
lon = data['longitude'].values

#Map size
padding = 0.15 # padding
lat_min = min(lat) - padding
lat_max = max(lat) + padding
lon_min = min(lon) - padding
lon_max = max(lon) + padding

#Create map using Basemap
plt.figure(figsize=(8,8))
m = Basemap(llcrnrlon=lon_min,
            llcrnrlat=lat_min,
            urcrnrlon=lon_max,
            urcrnrlat=lat_max,
            lat_0=(lat_max - lat_min)/2,
            lon_0=(lon_max-lon_min)/2,
            resolution = 'h',
            )
m.drawcoastlines()
m.drawmapboundary(fill_color='#46bcec')
m.fillcontinents(color = 'white',lake_color='#46bcec')

# plot points and cutoffs
m.scatter(lons, lats, marker = 'o', color='r', zorder=5)
m.plot([114.179,114.179],[lat_min,lat_max], 'k-', zorder=5)
m.plot([114.197,114.197],[lat_min,lat_max], 'b-', zorder=5)
m.plot([lon_min,lon_max],[22.4,22.4], 'g-', zorder=5)
plt.show()

The generated image should look the same as this one. The black, blue and green lines are the first, second and third cutoff respectively.

<img src="http://www.ticoneva.com/econ/econ4130/images/8-map-tree.png" width="300">


## B. Clustering

Clustering algorithms group data without supervision. To do so, they minimize some measure of distance between data within the same group. For example, this could be simple distance as measured by the difference between values, or it could be variation as measured by variance.

| Customer |   Address   | Latitude | Longitude |
|:--------:|:-----------:|:--------:|:---------:|
|     1    |   Central   |  22.2819 |  114.1581 |
|     2    |  Admiralty  |  22.2796 |  114.1655 |
|     3    | North Point |  22.2871 |  114.1917 |
|     4    |    Shatin   |  22.3771 |  114.1974 |
|     5    |    Fo Tan   |  22.3969 |  114.1959 |
|     6    |  Ma On Shan |  22.4221 |  114.2324 |
|     7    |   Chai Wan  |   22.27  |   114.24  |

In [None]:
# data
raw_data2 = [
            [1,'Central',22.2819,114.1581],
            [2,'Admiralty',22.2796,114.1655],
            [3,'North Point',22.2871,114.1917],
            [4,'Shatin',22.3771,114.1974],
            [5,'Fo Tan',22.3969,114.1959],
            [6,'Ma On Shan',22.4221,114.2324],
            [7,'Chai Wan',22.27,114.24]
            ]
labels = ['customer','address','latitude','longitude']
data2 = pd.DataFrame.from_records(raw_data2,columns=labels)

In [None]:
X3 = data2[["latitude","longitude"]]

#K-Means
from sklearn.cluster import *

#Two clusters


#Three clusters


#Four clusters


In [None]:
#Agglomerative Clustering
ac = AgglomerativeClustering(n_clusters=2)
y_ac = ac.fit_predict(X3)
print(y_ac)

ac = AgglomerativeClustering(n_clusters=3)
y_ac = ac.fit_predict(X3)
print(y_ac)

ac = AgglomerativeClustering(n_clusters=4)
y_ac = ac.fit_predict(X3)
print(y_ac)

Notice the labelling is random.