<a href="https://colab.research.google.com/github/sndaba/BinaryClassification-with-SVM-inRandPython-/blob/main/task2_random_data_SimisaniNdaba.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 2: Random Data?

## Question

> I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
>
> Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
>
> Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
>
> Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.

```R
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))

library(e1071)
library(pROC)

colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){
    #get train

    train=sample(seq(length(Y)),0.5*length(Y))
    if(min(table(Y[train]))==0)
    next

    #test from train
    test=seq(length(Y))[-train]

    #train model
    XX=X[train,]
    YY=Y[train]
    mod=svm(XX,YY,probability=FALSE)
    XXX=X[test,]
    predVec=predict(mod,XXX)
    RFans=attr(predVec,'decision.values')
    ansMat[test,i]=as.numeric(predVec)
}

ans=rowMeans(ansMat,na.rm=TRUE)

r=roc(Y,ans)$auc
print(r)
```

Similarly, when I implement the same thing in Python I get similar results.



In [None]:
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # train model
    mod = SVC(probability=False)
    mod.fit(X=X[train, :], y=Y[train])
    # predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))

0.8367346938775511


## Your answer

### Possible explaination
Computing the area under the curve where both X and Y variables are completely random would not necessarily give a value of 0.5. In fact, the value of the area under the curve would be highly dependent on the distribution of the random points in the plot.

If the points are uniformly distributed across the plot, then the area under the curve would be close to 0.5. However, if the points are clustered in certain areas of the plot, then the area under the curve would be different from 0.5.

Therefore, in order to accurately compute the area under the curve, it needs to be known more about the distribution. If it's assumed that the distributions are generated randomly according to a probability distribution, then statistical methods can be used to estimate the area under the curve.


### More possible explainations
There could be several more reasons why the area under the curve (AUC) is frequently significantly higher than 0.5, despite using random data for X and Y.

One possibility is overfitting.

When using *leave-group-out cross-validation*, the model is trained on a subset of the data and tested on a single observation that was left out. This process is repeated for all observations in the data. If the model is too complex and overfits the data, it may perform well on the training set but poorly on the test set. This can lead to a higher AUC than expected, as the model is essentially memorizing the training set.

Another possibility is random chance.

Although X and Y are completely random, there is still a chance that the data will create patterns that can be exploited by the model. With only a small number of observations, it is more likely that the model will find these patterns by chance, leading to a higher AUC.

Finally, it is possible that the code or implementation of the cross-validation process is incorrect or flawed, leading to incorrect results. It may be useful to double-check the code and ensure that it is correctly implementing leave-group-out cross-validation and calculating AUC.


### Countermeasure
There are several ways to address overfitting in a code, depending on the specific situation and model being used. There are several ways to address overfitting in a code, including

reducing model complexity,

regularization,

dropout,

increasing the amount of training data,

early stopping, and

using cross-validation.

After trying these possibilities, I choose to try changing the leave out leave group
out cross validation because it is in the code.

### Alternative Solution

To use **K fold Cross-Validation** in this code instead of leave-group-out cross-validation, to can modify the for loop to perform k-fold cross-validation. Here's an example of how you could modify the code to use 5-fold cross-validation in the following modified R version.

The folds variable is created to specify the folds for k-fold cross-validation using the sample() function. Then, the for loop is nested within another loop to iterate through each fold and train the model on the training set before making predictions on the test set. Finally, the results are averaged across all folds to obtain the final AUC.

#### R version
```R
install.packages('e1071')
install.packages('pROC')

library(e1071)
library(pROC)

#cross validation instead of leave out
Y <- as.factor(rep(c(1,2), times = 14))
X <- matrix(runif(length(Y)*100), nrow = length(Y))

colnames(X) <- 1:ncol(X)
iter <- 1000
ansMat <- matrix(NA, length(Y), iter)
k <- 7 # number of folds
set.seed(123) # set seed for reproducibility
folds <- sample(rep(1:k, length.out = length(Y))) # create folds

for (i in seq(iter)) {
  for (j in 1:k) {
    # get training and test indices for current fold
    train <- which(folds != j)
    test <- which(folds == j)

    # train model on current fold
    XX <- X[train, ]
    YY <- Y[train]
    mod <- svm(XX, YY, probability = FALSE)

    # make predictions on test set and store results
    XXX <- X[test, ]
    predVec <- predict(mod, XXX)
    RFans <- attr(predVec, 'decision.values')
    ansMat[test, i] <- as.numeric(predVec)
  }
}
ans <- rowMeans(ansMat, na.rm = TRUE)
r <- roc(Y, ans)$auc
print(r)
```
Area under the curve: 0.6429

#### Python version gives a lower AUC

In [None]:
import numpy as np
from sklearn import svm
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
from sklearn.metrics import roc_curve

Y = np.repeat([1, 2], repeats=14).astype(np.int64)
X = np.random.rand(len(Y)*100).reshape((len(Y), 100))

iter = 1000
ansMat = np.empty((len(Y), iter))
k = 5 # number of folds
np.random.seed(123) # set seed for reproducibility
folds = np.random.choice(k, size=len(Y), replace=True) # create folds

for i in range(iter):
    for j in range(k):
        # get training and test indices for current fold
        train = np.where(folds != j)[0]
        test = np.where(folds == j)[0]

        # train model on current fold
        XX = X[train, :]
        YY = Y[train]
        mod = svm.SVC(probability=False).fit(XX, YY)

        # make predictions on test set and store results
        XXX = X[test, :]
        predVec = mod.predict(XXX)
        ansMat[test, i] = predVec.astype(np.float64)

ans = np.nanmean(ansMat, axis=1)
r = roc_auc_score(Y, ans)
print(r)


0.3214285714285714


## Feedback

Was this exercise is difficult or not? In either case, briefly describe why.

I feel this task was not difficult because the possiblities were in the code and could be changed in hypermeters, classifiers and gridsearch techniques that could be used to get ideal results.

However, the task could be difficult if prediction technique is not a skill that is possessed by someone. A background knowledge in modelling would help in this task.