
For this case study, you will perform a classification task on a WiFi dataset, and also explore the question, "Is more data useful for a classification task?"

The dataset you will use can be found on: https://archive.ics.uci.edu/ml/datasets/ujiindoorloc .

**\[Step 1\]** Once you examine the data sets, you will find that there is a training set and a validation set. You can use them to build your classification model. You might need to determine what are your features and targets. You can also do some engineering on features and targets if necessary.

**\[Step 2\]** But, which algorithm should you use with your model? You can refer to the scikit-learn cheat sheet: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html, and try three algorithms. Some suggestions are: LinearSVC, Logistic Regression, KNN classifier, SVC, Random Forest (as an example of Ensemble Learning) etc. Perform one experiment using each and observe the performance of each model. Note which is the best performing model.

**\[Step 3\]** Once the previous step is done, observe if more data is useful for a classification task using the best performing model from the previous step. For this, randomly select 20% of the training samples, but keep the size of the validation set the same. Note the performance. Then also try with 40%, 60%, 80% and 100% of the training samples. Perform three experiments for each selection. This means, for 20% you will do three experiments, 40% three experiments etc. Find the average of three experiments for each selection and plot them using a chart of your choice.

**\[Step 4\]** Publish your finding in presentation slides. Like case study 1, three of you will be randomly chosen to present your work in front of the class. The slides should inform the audience about:

* the objective of the case study
* the data (features and targets)
* things you have done (e.g. why you selected a specific classification model)
* your findings.


**Things to note**:

* **Type of task**: classification
* **Features**: you choose.
* **Feature engineering**: You are welcome to do so.
* **Target**: Use a combination of features to learn from and identify the location. Ignore the SPACEID column.

* In some cases, Normalization may result in reduced accuracy.
* You must write enough comments so that anybody with some programming knowledge can understand your code.

**Grading Criteria**:

* [15 + 15] Data set preparation: Choosing your $X$ and $y$. Feature Engineering.
* [15 + 15 + 15] Three experiments using three algorithms.  
* [15] Observing the effects of more data using five sets of random samples of different sizes from the training set. 
* [10] Presentation slides

**What to submit**:

Put the Jupyter Notebook file and the .csv file in a folder. Then convert your presentation slides in to a PDF file and put it in the same folder. Zip the folder. After zipping, it should have the extension .zip. The name of the .zip file should be firstname_lastname_casestudy_2.zip . Upload the .zip file on Canvas.

In [None]:
import pandas as pd
from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report

#import training and validation datasets
df_train = pd.read_csv('./trainingData.csv')
df_val = pd.read_csv('./validationData.csv')

#drop columns I will not be using as features. 
df_train = df_train.drop(columns = ['SPACEID', 'PHONEID', 'TIMESTAMP', 'USERID', 'RELATIVEPOSITION', 'LONGITUDE', 'LATITUDE'])
df_val = df_val.drop(columns = ['SPACEID', 'PHONEID', 'TIMESTAMP', 'USERID', 'RELATIVEPOSITION', 'LONGITUDE', 'LATITUDE'])

In [None]:
#confirm no missing values
print(df_train.isnull().values.any())
print(df_val.isnull().values.any())

#Find out how many combinations of buildingid and floor there are
print(df_train.groupby(['BUILDINGID','FLOOR']).size())

In [None]:
#Create a single target variable
#Assign a value to each possible building/floor combination and add it to a single list of location identifiers

ft = df_train['FLOOR'].tolist()
bt = df_train['BUILDINGID'].tolist()
loct = []
z = zip(bt, ft)

for e1, e2 in z:
    if e1 == 0 and e2 == 0:
        loct.append(0)
    elif e1 == 0 and e2 == 1:
        loct.append(1)
    elif e1 == 0 and e2 == 2:
        loct.append(2)
    elif e1 == 0 and e2 == 3:
        loct.append(3)
    elif e1 == 1 and e2 == 0:
        loct.append(4)
    elif e1 == 1 and e2 == 1:
        loct.append(5)
    elif e1 == 1 and e2 == 2:
        loct.append(6)
    elif e1 == 1 and e2 == 3:
        loct.append(7)
    elif e1 == 2 and e2 == 0:
        loct.append(8)
    elif e1 == 2 and e2 == 1:
        loct.append(9)
    elif e1 == 2 and e2 == 2:
        loct.append(10)
    elif e1 == 2 and e2 == 3:
        loct.append(11)
    elif e1 == 2 and e2 == 4:
        loct.append(12)

#do the same for the validation set
fv = df_val['FLOOR'].tolist()
bv = df_val['BUILDINGID'].tolist()
locv = []
z = zip(bv, fv)

for e1, e2 in z:
    if e1 == 0 and e2 == 0:
        locv.append(0)
    elif e1 == 0 and e2 == 1:
        locv.append(1)
    elif e1 == 0 and e2 == 2:
        locv.append(2)
    elif e1 == 0 and e2 == 3:
        locv.append(3)
    elif e1 == 1 and e2 == 0:
        locv.append(4)
    elif e1 == 1 and e2 == 1:
        locv.append(5)
    elif e1 == 1 and e2 == 2:
        locv.append(6)
    elif e1 == 1 and e2 == 3:
        locv.append(7)
    elif e1 == 2 and e2 == 0:
        locv.append(8)
    elif e1 == 2 and e2 == 1:
        locv.append(9)
    elif e1 == 2 and e2 == 2:
        locv.append(10)
    elif e1 == 2 and e2 == 3:
        locv.append(11)
    elif e1 == 2 and e2 == 4:
        locv.append(12)


In [None]:
#drop floor and building ID feature columns and insert the new combined location feature column into the dataframe
df_train = df_train.drop(columns = ['FLOOR', 'BUILDINGID'])
df_val = df_val.drop(columns = ['FLOOR', 'BUILDINGID'])

df_train['LOCATION'] = loct
df_val['LOCATION'] = locv

In [None]:
#drop columns that have zero variance across all samples, i.e. the same value for every sample in the training set
n = df_train.nunique(axis = 0)
dropcol = n[n == 1].index
df_train = df_train.drop(dropcol, axis = 1)

#drop the same columns from the validation set
df_val = df_val.drop(dropcol, axis = 1)

df_train.head()

In [None]:
#assign the WAP columns as features and the location column as the target
xtrain = df_train.loc[:, 'WAP001' : 'WAP519']
ytrain = df_train.loc[:, 'LOCATION']

xval = df_val.loc[:, 'WAP001' : 'WAP519']
yval = df_val.loc[:, 'LOCATION']

In [None]:
#Attempt to fit model without normalizing the data
#this was the first model I tried but after trying a fourth, this one had the least accuracy.
#I kept the code here anyway.
#lsvc = LinearSVC(dual = False)
#lsvc.fit(xtrain, ytrain)
#ylsvc = lsvc.predict(xval)
#accuracy_score(yval, ylsvc)

In [None]:
#Attempt to fit model without normalizing the data
knn = KNeighborsClassifier()
knn.fit(xtrain, ytrain)
yknn = knn.predict(xval)
accuracy_score(yval, yknn)

In [None]:
#Attempt to fit model without normalizing the data
rf = RandomForestClassifier()
rf.fit(xtrain, ytrain)
yrf = rf.predict(xval)
accuracy_score(yval, yrf)

In [None]:
#Attempt to fit model without normalizing the data
svc = svm.SVC()
svc.fit(xtrain, ytrain)
ysvc = svc.predict(xval)
accuracy_score(yval, ysvc)

In [None]:
#normalize the data
#Confirm the maximum and minimum value of the selected features and if there are values measured at 0, i.e. the strongest signal
print(xtrain.min().min())
print(xtrain.max().max())
strongest = 0 in xtrain.values
print(strongest)
xtrain.describe()

#I replaced the undetected WAP values with -105 so they would represent the weakest signal.
#Then I normalized the data bringing the values between 0 and 1 with 0 being the weakest and 1 being the strongest.
xtrain = xtrain.replace(100, -105)
xtrain = 1 - (xtrain/-105)

xval = xval.replace(100, -105)
xval = 1 - (xval/-105)

xtrain.describe()

In [None]:
#this was the first model I tried but after trying a fourth, this one had the least accuracy.
#I kept the code here anyway.
#lsvc = LinearSVC(dual = False)
#lsvc.fit(xtrain, ytrain)
#ylsvc = lsvc.predict(xval)
#accuracy_score(yval, ylsvc)

In [None]:
knn = KNeighborsClassifier()
knn.fit(xtrain, ytrain)
yknn = knn.predict(xval)
accuracy_score(yval, yknn)

In [None]:
rf = RandomForestClassifier()
rf.fit(xtrain, ytrain)
yrf = rf.predict(xval)
accuracy_score(yval, yrf)

In [None]:
svc = svm.SVC()
svc.fit(xtrain, ytrain)
ysvc = svc.predict(xval)
a100 = accuracy_score(yval, ysvc)
print(a100)

In [None]:
#SVC was the best performing model
#visualize the results
cm = confusion_matrix(yval, ysvc)
cmp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels= svc.classes_)
cmp.plot()
plt.show()
print(classification_report(yval,ysvc))

In [None]:
#I normalized the features on the xtrain dataframe so I am updating df_train with the scaled features so I can use df_train to get sample training sets.
df_train.loc[:,'WAP001':'WAP519'] = xtrain

#getting three different sets of 20% of the training set
train20_1 = df_train.sample(n = int(.2*len(df_train)))
train20_2 = df_train.sample(n = int(.2*len(df_train)))
train20_3 = df_train.sample(n = int(.2*len(df_train)))
x20_1 = train20_1.loc[:,'WAP001':'WAP519']
y20_1 = train20_1.loc[:,'LOCATION']
x20_2 = train20_2.loc[:,'WAP001':'WAP519']
y20_2 = train20_2.loc[:,'LOCATION']
x20_3 = train20_3.loc[:,'WAP001':'WAP519']
y20_3 = train20_3.loc[:,'LOCATION']

#run model on each set
svc.fit(x20_1, y20_1)
predict20_1 = svc.predict(xval)
a20_1 = accuracy_score(yval, predict20_1)
print(a20_1)
svc.fit(x20_2, y20_2)
predict20_2 = svc.predict(xval)
a20_2 = accuracy_score(yval, predict20_2)
print(a20_2)
svc.fit(x20_3, y20_3)
predict20_3 = svc.predict(xval)
a20_3 = accuracy_score(yval, predict20_3)
print(a20_3)

#average accuracy of the three experiments
a20 = (a20_1 + a20_2 + a20_3)/3
print(a20)

In [None]:
#getting three different sets of 40% of the training set
train40_1 = df_train.sample(n = int(.4*len(df_train)))
train40_2 = df_train.sample(n = int(.4*len(df_train)))
train40_3 = df_train.sample(n = int(.4*len(df_train)))
x40_1 = train40_1.loc[:,'WAP001':'WAP519']
y40_1 = train40_1.loc[:,'LOCATION']
x40_2 = train40_2.loc[:,'WAP001':'WAP519']
y40_2 = train40_2.loc[:,'LOCATION']
x40_3 = train40_3.loc[:,'WAP001':'WAP519']
y40_3 = train40_3.loc[:,'LOCATION']

#run model on each set
svc.fit(x40_1, y40_1)
predict40_1 = svc.predict(xval)
a40_1 = accuracy_score(yval, predict40_1)
print(a40_1)
svc.fit(x40_2, y40_2)
predict40_2 = svc.predict(xval)
a40_2 = accuracy_score(yval, predict40_2)
print(a40_2)
svc.fit(x40_3, y40_3)
predict40_3 = svc.predict(xval)
a40_3 = accuracy_score(yval, predict40_3)
print(a40_3)

#average accuracy of the three experiments
a40 = (a40_1 + a40_2 + a40_3)/3
print(a40)

In [None]:
#getting three different sets of 60% of the training set
train60_1 = df_train.sample(n = int(.6*len(df_train)))
train60_2 = df_train.sample(n = int(.6*len(df_train)))
train60_3 = df_train.sample(n = int(.6*len(df_train)))
x60_1 = train60_1.loc[:,'WAP001':'WAP519']
y60_1 = train60_1.loc[:,'LOCATION']
x60_2 = train60_2.loc[:,'WAP001':'WAP519']
y60_2 = train60_2.loc[:,'LOCATION']
x60_3 = train60_3.loc[:,'WAP001':'WAP519']
y60_3 = train60_3.loc[:,'LOCATION']

#run model on each set
svc.fit(x60_1, y60_1)
predict60_1 = svc.predict(xval)
a60_1 = accuracy_score(yval, predict60_1)
print(a60_1)
svc.fit(x60_2, y60_2)
predict60_2 = svc.predict(xval)
a60_2 = accuracy_score(yval, predict60_2)
print(a60_2)
svc.fit(x60_3, y60_3)
predict60_3 = svc.predict(xval)
a60_3 = accuracy_score(yval, predict60_3)
print(a60_3)

#average accuracy of the three experiments
a60 = (a60_1 + a60_2 + a60_3)/3
print(a60)

In [None]:
#getting three different sets of 80% of the training set
train80_1 = df_train.sample(n = int(.8*len(df_train)))
train80_2 = df_train.sample(n = int(.8*len(df_train)))
train80_3 = df_train.sample(n = int(.8*len(df_train)))
x80_1 = train80_1.loc[:,'WAP001':'WAP519']
y80_1 = train80_1.loc[:,'LOCATION']
x80_2 = train80_2.loc[:,'WAP001':'WAP519']
y80_2 = train80_2.loc[:,'LOCATION']
x80_3 = train80_3.loc[:,'WAP001':'WAP519']
y80_3 = train80_3.loc[:,'LOCATION']

#run model on each set
svc.fit(x80_1, y80_1)
predict80_1 = svc.predict(xval)
a80_1 = accuracy_score(yval, predict80_1)
print(a80_1)
svc.fit(x80_2, y80_2)
predict80_2 = svc.predict(xval)
a80_2 = accuracy_score(yval, predict80_2)
print(a80_2)
svc.fit(x80_3, y80_3)
predict80_3 = svc.predict(xval)
a80_3 = accuracy_score(yval, predict80_3)
print(a80_3)

#average accuracy of the three experiments
a80 = (a80_1 + a80_2 + a80_3)/3
print(a80)

In [None]:
xaxis = ['20%', '40%', '60%', '80%', '100%']
yaxis = [a20, a40, a60, a80, a100]
plt.scatter(xaxis, yaxis, color = 'pink')
plt.ylim(.90, .95, .1)
plt.xlabel('Percentage of Training Set')
plt.ylabel('Average Model Accuracy')
for i in range(len(yaxis)):
    plt.annotate(str(yaxis[i])[0:5],(xaxis[i],yaxis[i]))
plt.show()