# Session 1
## Case Study 1
### Lab

### Data Source
We use the credit card data set for this problem

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for __Unbalanced Classification__. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

### Objective
To use the kNN classifier and explore the ramifications, for a highly unbalanced dataset.

#### Data Set Information:
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. 

### Data Attributes
Due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA (to be discussed in future sessions), the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise

### Predicted attribute
Class: Fraud or Genuine 

In [1]:
import pandas as pd
pd.read_csv("../../Datasets/10kcc.csv")

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
5,2,-0.425966,0.960523,1.141109,-0.168252,0.420987,-0.029728,0.476201,0.260314,-0.568671,...,-0.208254,-0.559825,-0.026398,-0.371427,-0.232794,0.105915,0.253844,0.081080,3.67,0
6,4,1.229658,0.141004,0.045371,1.202613,0.191881,0.272708,-0.005159,0.081213,0.464960,...,-0.167716,-0.270710,-0.154104,-0.780055,0.750137,-0.257237,0.034507,0.005168,4.99,0
7,7,-0.644269,1.417964,1.074380,-0.492199,0.948934,0.428118,1.120631,-3.807864,0.615375,...,1.943465,-1.015455,0.057504,-0.649709,-0.415267,-0.051634,-1.206921,-1.085339,40.80,0
8,7,-0.894286,0.286157,-0.113192,-0.271526,2.669599,3.721818,0.370145,0.851084,-0.392048,...,-0.073425,-0.268092,-0.204233,1.011592,0.373205,-0.384157,0.011747,0.142404,93.20,0
9,9,-0.338262,1.119593,1.044367,-0.222187,0.499361,-0.246761,0.651583,0.069539,-0.736727,...,-0.246914,-0.633753,-0.120794,-0.385050,-0.069733,0.094199,0.246219,0.083076,3.68,0


In [2]:
import math
import collections
def dist(a, b):
    sqSum = 0
    for i in range(len(a)):
        sqSum += (a[i] - b[i]) ** 2
    return math.sqrt(sqSum)
# ------------------------------------------------ #
# We are assuming that the label is the last field #
# If not, munge the data to make it so!            #
# ------------------------------------------------ #
def kNN(k, train, given):
    distances = []
    for t in train:
        distances.append((dist(t[:-1], given), t[-1]))
    distances.sort()
    return distances[:k]

def kNN_classify(k, train, given):
    tally = collections.Counter()
    for nn in kNN(k, train, given):
        tally.update(nn[-1])
    return tally.most_common(1)[0]

**Exercise 1** :: Split the dataset into for training and testing. Use K = 7 and Euclidean distance. Find the accuracy (as percentage) on the test data.

In [3]:
import random

# Read data and store into a variable
creditcardData = pd.read_csv("../../Datasets/10kcc.csv", header=None, low_memory=False).values[1:]

# Split data set into training/test in ratio 80/20
TRAIN_TEST_RATIO = 0.8
picker = list(range(creditcardData.shape[0]))
random.shuffle(picker)       ### randomly shuffle the data
trainMax = int(len(picker) * TRAIN_TEST_RATIO)
train = []
test = []

for pick in picker[:trainMax]:
    trainData = list(map(float, creditcardData[pick][:-1]))
    trainData += list(creditcardData[pick][-1])
    train.append(trainData)         ### select 80% of data to be used as training set
for pick in picker[trainMax:]:
    testData = list(map(float, creditcardData[pick][:-1]))
    testData += list(creditcardData[pick][-1])
    test.append(testData)       ### select 20% of data to be used as test set

results = []
for i, t in enumerate(test):
    results.append(kNN_classify(7, train, t)[0] == test[i][-1])
print(results.count(True), "are correct")
correctPredicted = results.count(True)
totalTestData = len(test)
accuracy = (correctPredicted / totalTestData) * 100
print('The accuracy is ' + str(accuracy))

1989 are correct
The accuracy is 99.45


**Exercise 2** :: Repeat the above (creating random partitions and evaluating the performance) 5 times. 

In [10]:
import random

# Read data and store into a variable
creditcardData = pd.read_csv("../Datasets/10kcc.csv", header=None, low_memory=False).values[1:]

j = 0
accuracyList = []
avgAccuracy = 0

for j in range(0, 5):
    # Split data set into training/test in ratio 80/20
    TRAIN_TEST_RATIO = 0.8
    picker = list(range(creditcardData.shape[0]))
    random.shuffle(picker)       ### randomly shuffle the data
    trainMax = int(len(picker) * TRAIN_TEST_RATIO)
    train = []
    test = []

    for pick in picker[:trainMax]:
        trainData = list(map(float, creditcardData[pick][:-1]))
        trainData += list(creditcardData[pick][-1])
        train.append(trainData)         ### select 80% of data to be used as training set
    for pick in picker[trainMax:]:
        testData = list(map(float, creditcardData[pick][:-1]))
        testData += list(creditcardData[pick][-1])
        test.append(testData)       ### select 20% of data to be used as test set

    results = []
    for i, t in enumerate(test):
        results.append(kNN_classify(7, train, t)[0] == test[i][-1])
        
    print(results.count(True), "are correct")
    correctPredicted = results.count(True)
    totalTestData = len(test)
    accuracy = (correctPredicted / totalTestData) * 100
    print('The accuracy is ' + str(accuracy))
    accuracyList.append(accuracy)
    avgAccuracy += accuracy
    
avgAccuracy = avgAccuracy / len(accuracyList)
print('The average accuracy is ' + str(avgAccuracy))

1994 are correct
The accuracy is 99.7
1992 are correct
The accuracy is 99.6
1990 are correct
The accuracy is 99.5
1995 are correct
The accuracy is 99.75
1991 are correct
The accuracy is 99.55000000000001
The average accuracy is 99.62
