## Math 425 final project problem 2

You are given part of the Wisconsin Diagnostic Breast Cancer (WDBC) dataset1
. For each patient, you are given a vector a giving features computed from digitized images of a fine
needle aspirate (FNA) of a breast mass for that patient. The features describe characteristics
of the cell nuclei present in the image. The goal is to decide whether the cells are malignant
or benign.

Here is a brief description of the way the features were computed. Ten real-valued
quantities are computed for each cell nucleus:

• radius ( mean of distances from center to points on the perimeter)

• texture (standard deviation of gray-scale values)

• perimeter

• area

• smoothness (local variation in radius lengths)

• compactness (perimeter2/ area - 1.0)

• concavity (severity of concave portions of the contour)

• concave points (number of concave portions of the contour)

• symmetry

• fractal dimension (“coastline approximation” - 1)


The mean, standard error (stderr), and a measure of the largest (worst) (mean of the largest
values) of each of the features were computed for each image. Thus each specimen is represented by a vector a with thirty entries. The domain D consists of thirty strings identifying these features, e.g. ‘‘radius (mean)", ‘‘ radius (stderr)", ‘‘radius (worst)",
1

(https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
‘‘area (mean)", and so on. Two files are provided containing data, train.data and
validate.data. Also provided is the module efficient cancer data.
The procedure in read training data in the efficient cancer data module takes a single
argument, a string giving the pathname of a file. It reads the data in the specified file and
returns a pair (A, b) where:

• A is a matrix whose rows correspond to the data for each patient in the data set. The
elements in a row correspond to the 30 features measured for a patient.
• b is a vector whose domain is the set of patients and b[r] is 1 if the specimen of patient
r is malignant and it’s -1 if the specimen is benign.

Use read training data to read the data in the file train.data into the variables A, b.
(a) Use the QR algorithm to find the least-squares linear model for the data.
(b) Apply the linear model from (a) to the data set validate.data and predict the malignancy of the tissues. 

You will have to define a classifier function
C(y) = (+1 if the prediction is non-negative,−1 otherwise
(c) What is the percentage of samples that are incorrectly classified? Is it greater or
smaller than the success rate on the training data?


In [1]:
import numpy as np
import numpy.linalg as la
import matplotlib as plt
import vec
from vec import Vec
from vecutil import vec2list
from sympy import Matrix

In [2]:
# %load efficient_cancer_data.py
# Copyright 2013 Philip N. Klein


def read_training_data(fname, D=None):
    """Given a file in appropriate format, and given a set D of features,
    returns the pair (A, b) consisting of
    a P-by-D matrix A and a P-vector b,
    where P is a set of patient identification integers (IDs).

    For each patient ID p,
      - row p of A is the D-vector describing patient p's tissue sample,
      - entry p of b is +1 if patient p's tissue is malignant, and -1 if it is benign.

    The set D of features must be a subset of the features in the data (see text).
    """
    file = open(fname)
    params = ["radius", "texture", "perimeter","area","smoothness","compactness","concavity","concave points","symmetry","fractal dimension"];
    stats = ["(mean)", "(stderr)", "(worst)"]
    feature_labels = set([y+x for x in stats for y in params])
    feature_map = {params[i]+stats[j]:j*len(params)+i for i in range(len(params)) for j in range(len(stats))}
    if D is None: D = feature_labels
    feature_vectors = {}
    #patient_diagnoses = {}
    A = []
    b = []
    for line in file:
        row = line.split(",")
        patient_ID = int(row[0])
        b.append(-1) if row[1] == 'B' else b.append(1)
        feature_vectors[patient_ID] = Vec(D, {f:float(row[feature_map[f]+2]) for f in D})
        A.append(vec2list(feature_vectors[patient_ID]))
    return Matrix(A), Matrix(b)
        

    $$\hat{x}=R^{-1}Q^Tb$$

In [35]:
def classifier(x):
    n,m=np.shape(x)
    for i in range(n):
        for j in range(m):
            if x[i][j]>=0:
                x[i][j]=1
            else:
                x[i][j]=-1
    return x 

In [53]:
def compare_norms(x,y):
    x_norm=la.norm(x)
    y_norm=la.norm(y)
    #display difference
    print("Norm 1:",x_norm," Norm 2:",y_norm)
    print("Difference:",np.abs(x_norm-y_norm),"\n")

In [54]:
#Read values from files
At,bt=read_training_data('train.data')

#store data as numpy arrays because sympy is less mutable

At=np.array(At).astype(np.float64)
bt=np.array(bt).astype(np.float64)

#calc QR
Q,R=np.linalg.qr(At)

#xhat =R^-1Q^tb
xhat=np.matmul(la.inv(R),Q.T)
xhat=np.matmul(xhat,bt)
"""
Compute bhat
bhat=np.matmul(At,xhat)
print("pre classification")
compare_norms(bt,bhat)
print("post classification")
bhat=classifier(bhat)
compare_norms(bt,bhat)
print("false positives:",count_false_positives(bv,bhat))
"""

'\n#Compute bhat\nbhat=np.matmul(At,xhat)\nprint("pre classification")\ncompare_norms(bt,bhat)\nprint("post classification")\nbhat=classifier(bhat)\ncompare_norms(bt,bhat)\nprint("false positives:",count_false_positives(bv,bhat))\n'

In [55]:
def count_false_positives(b,bhat):
    count=0
    for i in range(np.shape(b)[0]):
            if bhat[i][0]==1 and b[i][0]==-1:
                print("row:",i,"bhat:",bhat[i][0],"b:",b[i][0])
                count+=1
    return count

In [56]:
#Load data 
Av,bv=read_training_data('validate.data')
Av=np.array(Av).astype(np.float64)
bv=np.array(bv).astype(np.float64)

bhat=np.matmul(Av,xhat)

#Compute bhat
bhat=np.matmul(Av,xhat)

#test norms 
print("pre classification")
compare_norms(bv,bhat)
print("post classification")
bhat=classifier(bhat)
compare_norms(bv,bhat)

print("false positives:",count_false_positives(bv,bhat))



pre classification
Norm 1: 16.1245154965971  Norm2: 13.299651891118701
Difference: 2.824863605478397 

post classification
Norm 1: 16.1245154965971  Norm2: 16.1245154965971
Difference: 0.0 

row: 47 bhat: 1.0 b: -1.0
row: 113 bhat: 1.0 b: -1.0
row: 155 bhat: 1.0 b: -1.0
row: 165 bhat: 1.0 b: -1.0
row: 171 bhat: 1.0 b: -1.0
row: 241 bhat: 1.0 b: -1.0
false positives: 6
