# PCA/LDA Assignment
## CS156 // Professor Sterne
## Soren Gran // 10/22/18

In [112]:
import os
from os import listdir
from PIL import Image, ImageFile
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
ImageFile.LOAD_TRUNCATED_IMAGES = True

In [111]:
# first we need to build a way to make our images all the same size arrays so we can build a model for them
# What kind of data cleaning do we want to do? We want to make sure the datasets do not share any images.
# It would also be helpful to choose images that have the same backgrounds so our classifier can focus on what
# is likely most important: the center of the image.

# Originally I was going to do a men-women classifier but the images are really terrible for classification
# compared to the jersey-shirt images.

def img_to_array(img):
    img = img.resize((138,138)) # making the image the same size
    img = list(img.getdata()) # put the image info into a list. Now we have a list of lists:
                            # For an image, we have a list of 138**2 = 19044 lists of the R(ed)G(reeen)B(lue) data
                            # (3 numbers between 0 and 255 because there are 256 levels for each color)
    img = list(map(list, img)) # Instead of a list of lists, we want a list of coordinates that sklearn can take
    img = np.array(img)
    s = img.shape[0] * img.shape[1] # img.shape gives us the shape of the dataset as if we had these in rows-columns
                            # In this case, we have an array of 19044 data of 3 coordinates, so our shape is 19044, 3
                            # Therefore, s gives us the total number of data in the array.
    img_wide = img.reshape(1, s) # This puts all our data into one array
    return img_wide[0]

jerseydir = listdir('/Users/sorengran/Downloads/Jersey')
shirtdir = listdir('/Users/sorengran/Downloads/Shirt')

visited = []

jerseys = []
for image in jerseydir:
    img = Image.open('/Users/sorengran/Downloads/Jersey/' + image).convert('RGB')
    if img.getpixel((0,0)) == (255,255,255): # first we check to make sure the top left corner pixel is white
                                # We hope this means the background is white (this will not always be true)
        new_image = img_to_array(img)
        jerseys.append(new_image)
        visited.append(hash(str(new_image[25000:26000])))

doubles = []
shirts = []
for image in shirtdir:
        img = Image.open('/Users/sorengran/Downloads/Shirt/' + image).convert('RGB')
        if img.getpixel((0,0)) == (255,255,255):
            new_image = img_to_array(img)
            if hash(str(new_image[25000:26000])) not in visited:
                shirts.append(new_image)
            else:
                doubles.append(new_image)
# print(len(doubles)) # I did this to make sure my hash method for finding doubles wasn't catching too many images
# Its length was 16 so I am pretty sure it worked fine
# We have 355 jersey images and 297 shirt images. That is pretty good.

In [138]:
# Classify the data
raw_data = [(row,'1') for row in jerseys] + [(row,'0') for row in shirts]

# Break into X, y
data = np.array([x for (x,y) in raw_data])
labels = np.array([y for (x,y) in raw_data])

# Scale the data - this didn't have much result so I will leave my data unscaled in general
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.30, random_state=42)



In [144]:
# I will use this function to analyze my results
# Loosely taken from https://stackoverflow.com/questions/31324218/scikit-learn-how-to-obtain-true-positive-true-negative-false-positive-and-fal
# Obviously the Recall is the same as the True Positive rate but I included them both separately because it seemed clearer
def tfpn(y_true, y_pred):
    tp, tn, fp, fn = 0, 0, 0, 0
    for i in range(len(y_true)):
        if y_true[i] == y_pred[i] == '1':
            tp += 1
        elif y_pred[i] == '1' and y_pred[i] != y_true[i]:
            fp += 1
        elif y_pred[i] == y_true[i] == '0':
            tn += 1
        else:
            fn += 1
    print('True positive rate = ', tp/(fn+tp))
    print('False positive rate = ', fp/(tn+fp))
    print('True negative rate = ', tn/(tn+fp))
    print('False negative rate = ', fn/(fn+tp))
    print('Precision = ', tp/(fp+tp))
    print('Recall = ', tp/(tp+fn))
    print('Accuracy = ', (tp+tn)/len(y_true))

In [153]:
# Basic Logistic Regression
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
train_predictions = logistic_model.predict(X_train)
predictions = logistic_model.predict(X_test)


# Unscaled Data results
# Train data accuracy score: 1.0
# Test data accuracy score: .668
# Scaled Data results
# Train data accuracy score: 1.0
# Test data accuracy score: .694

print("Train data statistics")
tfpn(y_train, train_predictions)
print('    ')
print("Test data statistics")
tfpn(y_test, predictions)

# Obviously we should do well on the data on which we trained our model.
# I didn't expect our logistic model to do very well because we have so many dimensions and only around 500 observations
# This means that the logistic model would have a hard time fitting the data based on all the dimensions,
# since we know that we can't expect a regression model to work well when len(data) < 2^dimensions
# Hopefully many of the dimensions are constant because of white backgrounds but there are still many dimensions.
# I am surprised we achieved almost 70% accuracy for new data.

Train data statistics
True positive rate =  1.0
False positive rate =  0.0
True negative rate =  1.0
False negative rate =  0.0
Precision =  1.0
Recall =  1.0
Accuracy =  1.0
    
Test data statistics
True positive rate =  0.6460176991150443
False positive rate =  0.30120481927710846
True negative rate =  0.6987951807228916
False negative rate =  0.35398230088495575
Precision =  0.7448979591836735
Recall =  0.6460176991150443
Accuracy =  0.6683673469387755


In [160]:
# PCA
for i in [15, 30, 45, 196]: # We want to try different dimensional reduction to see which works
    pca = PCA(n_components = i) # Changing the components 
    PCA_train = pca.fit_transform(X_train)
    PCA_test = pca.fit_transform(X_test)
    logistic_model.fit(PCA_train, y_train)
    PCA_train_predictions = logistic_model.predict(PCA_train)
    PCA_test_predictions = logistic_model.predict(PCA_test)
    print('Training data results for %s components' % i)
    tfpn(y_train, PCA_train_predictions)
    print('   ')
    print('Test data results for %s components' % i)
    tfpn(y_test, PCA_test_predictions)
    print('   ')


# Scaled Data results (train results, test results) (accuracy)
# n_components = 15: 0.7258771929824561, 0.49489795918367346
# n_components = 30: 0.7850877192982456, 0.4897959183673469
# n_components = 45: 0.793859649122807, 0.5561224489795918
# n_components = 196: 1, 0.5102040816326531

# Unscaled Data results (train results, test results) (accuracy)
# n_components = 15: 0.7149122807017544, 0.6326530612244898
# n_components = 30: 0.7850877192982456, 0.5408163265306123
# n_components = 45: 0.7807017543859649, 0.5408163265306123
# n_components = 196: 1.0, 0.5306122448979592

Training data results for 15 components
True positive rate =  0.6859504132231405
False positive rate =  0.2570093457943925
True negative rate =  0.7429906542056075
False negative rate =  0.3140495867768595
Precision =  0.751131221719457
Recall =  0.6859504132231405
Accuracy =  0.7127192982456141
   
Test data results for 15 components
True positive rate =  0.5486725663716814
False positive rate =  0.26506024096385544
True negative rate =  0.7349397590361446
False negative rate =  0.45132743362831856
Precision =  0.7380952380952381
Recall =  0.5486725663716814
Accuracy =  0.6275510204081632
   
Training data results for 30 components
True positive rate =  0.7396694214876033
False positive rate =  0.16355140186915887
True negative rate =  0.8364485981308412
False negative rate =  0.2603305785123967
Precision =  0.8364485981308412
Recall =  0.7396694214876033
Accuracy =  0.7850877192982456
   
Test data results for 30 components
True positive rate =  0.48672566371681414
False positive rat

In [159]:
# LDA
# Code taken from http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis()
clf.fit(X_train, y_train)
train, test = clf.transform(X_train), clf.transform(X_test)

logistic_model.fit(train, y_train)
train_predictions = logistic_model.predict(train)
print('train data statistics')
tfpn(y_train, train_predictions)
print('    ')

predictions = logistic_model.predict(test)
print('test data statistics')
tfpn(y_test, predictions)


# Transforming the data then running logistic regression and using LDA's predict are the same thing
# LDA_test_predictions = clf.predict(X_test)
# LDA_train_predictions = clf.predict(X_train)
# tfpn(y_test, LDA_test_predictions)



train data statistics
True positive rate =  0.9710743801652892
False positive rate =  0.03271028037383177
True negative rate =  0.9672897196261683
False negative rate =  0.028925619834710745
Precision =  0.9710743801652892
Recall =  0.9710743801652892
Accuracy =  0.9692982456140351
    
test data statistics
True positive rate =  0.6814159292035398
False positive rate =  0.3132530120481928
True negative rate =  0.6867469879518072
False negative rate =  0.3185840707964602
Precision =  0.7475728155339806
Recall =  0.6814159292035398
Accuracy =  0.6836734693877551


## Results Summary:
### Logistic Regression
|data |TP rate|FP rate|TN rate|FN rate|Precision|Recall|Accuracy|
|-----|-------|-------|-------|-------|---------|------|--------|
|Train|1.0    |0.0    |1.0    |0.0    |1.0      |1.0   |1.0     |
|Test |.646   |.301   |.699   |.354   |.745     |.646  |.668    |


### PCA
|data |number of components|TP rate|FP rate|TN rate|FN rate|Precision|Recall|Accuracy|
|-----|-------------------|-------|-------|-------|-------|---------|------|--------|
|Train|15                 |.661   |.224   |.776   |.339   |.769     |.661  |.715    |
|Train|30                 |.740   |.159   |.841   |.260   |.840     |.740  |.787    |
|Train|45                 |.740   |.159   |.841   |.260   |.840     |.740  |.787    |
|Train|196                |1.0    |0.0    |1.0    |0.0    |1.0      |1.0   |1.0     |
|Test |15                 |.531   |.229   |.771   |.469   |.759     |.531  |.633    |
|Test |30                 |.540   |.337   |.663   |.460   |.685     |.540  |.592    |
|Test |45                 |.522   |.373   |.627   |.478   |.656     |.522  |.566    |
|Test |196                |.504   |.434   |.566   |.496   |.613     |.504  |.531    |

### LDA
|data |TP rate|FP rate|TN rate|FN rate|Precision|Recall|Accuracy|
|-----|-------|-------|-------|-------|---------|------|--------|
|Train|.971   |.033   |.967   |.029   |.971     |.971  |.969    |
|Test |.681   |.313   |.687   |.319   |.748     |.681  |.684    |


## Reflection
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;What did I see in my results? PCA did not perform very well, with a maximum performance with 30 components that was well below Logistic Regression and LDA. LDA and Logistic Regression had almost the same performance, with LDA edging Logistic Regression in Precision, Accuracy, and Recall. What does this mean? Why would we expect these results?  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The major difference between PCA and LDA is that PCA is unsupervised while LDA is supervised. In training, this means that PCA tries to find out what makes a picture a picture. Unfortunately, given our data set, this didn't work. I think this would work better for data where each variable is always relevant in the same way, like financial data with a high number of dimensions. In our data, which were actually the colors of pixels, a given pixel could represent the color of the shirt in one image but then represent the background in the next. I think this prevented PCA from gaining a good understanding of what makes a shirt or a jersey.  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;LDA is a supervised algorithm. This means that given X, y data, LDA will try to figure out what separates y=1 and y=0. I think this better suited our problem since the data is not really comprehensible. One could look at a lot of financial data and probably make conclusions because credit score is always credit score. But in this data, pixel #23345 was sometimes the shirt, the lapel, the background, the collar, the design, or anything. So instead of trying to understand what that pixel meant, it was smarter to just try to differentiate between a shirt and a jersey.  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;I also think what may have contributed was my selection of photos with a white background during data cleaning. PCA considered the backgrounds for both shirts and jerseys so it would have considered white backgrounds as being part of both shirts and jerseys, I think. This may have made it harder to tell shirts and jerseys apart because they all looked relatively similar given the same backgrounds. This technique probably helped LDA, though. If all the backgrounds were white, then this did not differentiate shirts and jerseys so LDA would probably not focus on the background. This meant that it would probably focus on the center of the image where the shirt/jersey hopefully was. I think Logistic Regression would have a similar approach as LDA which is probably why it also outperformed PCA.  