### Statistics vs. machine learning

- Statistics: Summarizing and fitting the (all your) data

- Machine learning: Making predictions on (new held-out) data 

### When to use a classifier vs. deep learning
- Classifier: For smaller or lower-dimensional data. Works with fewer samples.

- Deep learning: Large or high-dimensional data (e.g., images, text). Requires more samples. Generally slower to train, but more accurate/flexible.

In [4]:

project_name = 'vlad-lab'
import os
#get current working directory
cwd = os.getcwd()
git_dir = cwd.split(project_name)[0] + project_name
import sys
sys.path.append(git_dir)

import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import RidgeClassifierCV, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier, NearestCentroid

import pdb
import scipy.stats as stats
from glob import glob as glob

In [5]:
def concat_acts(acts_dir):
    """
    Concatenate activation files from a directory and extract labels from filenames.
    
    Parameters:
    acts_dir (str): Directory containing activation .npy files.
    
    Returns:
    all_acts (np.ndarray): Concatenated activations.
    all_labels (np.ndarray): Corresponding labels extracted from filenames.
    """

    # set activation directory
    #acts_dir = f'{git_dir}/demo_data/acts/resnet50'

    #glob all activation files
    act_files = glob(f'{acts_dir}/*.npy')

    #loop through activation files and concatenate them
    # also get labels from filenames
    all_acts = []
    all_labels = []
    for act_file in act_files:
        acts = np.load(act_file)
        label = os.path.basename(act_file).split('_')[1]  # assuming label is the second part of filename
        n_samples = acts.shape[0]
        
        all_acts.append(acts)
        all_labels.extend([label] * n_samples)

    all_acts = np.vstack(all_acts)
    all_labels = np.array(all_labels)

    return all_acts, all_labels

### Common classifiers
- Decision trees: Basically a set of if-then rules. If X > 5 and Y < 3, then class A, else class B.
    - Good when features have non-linear relationships with the class labels or when there aren't many features (<1000). Can overfit if not pruned.
    - Random forests: An ensemble of decision trees trained on random subsets of the data and features. Reduces overfitting compared to a single decision tree.    

- Support vector machines (SVMs): Find the hyperplane (i.e., line) between the data that best separates classes in feature space.
    - Good for smaller datasets with clear distinction between classes. Can use kernel functions to handle non-linear boundaries.

- K-nearest neighbors (k-NN): Classify new sample based on which k training samples are closest to it in the feature space. 
    - Good for smaller datasets that may not have clear borders, but do have clusters.
    - nearest centroid: A special case of k-NN where k = 1 and the centroid (mean) of each class is used as the neighbor.

- Logistic regression: Models the probability of a sample belonging to a class using a logistic function.
    - Good for  datasets with linear relationships between features and class labels.
    - Ridge regression: A type of linear regression that weights the features and penalizes large coefficients to prevent overfitting.

**Note: these are oversimplifications of each method**


In [6]:
classifier = 'KNN'  # options: 'SVM', 'Ridge', 'NB', 'KNN', 'logistic', 'NC'

if classifier == 'SVM':
    clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
elif classifier == 'Ridge':
    clf = make_pipeline(StandardScaler(), RidgeClassifierCV())
elif classifier == 'NB':
    clf = make_pipeline(StandardScaler(), GaussianNB())
elif classifier == 'KNN':
    clf = make_pipeline(StandardScaler(), KNeighborsClassifier())
elif classifier == 'logistic':
    clf = make_pipeline(StandardScaler(), LogisticRegression())
elif classifier == 'NC':
    clf = make_pipeline(StandardScaler(), NearestCentroid())

### Cross-validation
Fit the model on a subset of the data (training set) and evaluate performance on held-out data (validation set). - - Repeat multiple times with different splits to get an average performance metric. 

**Common approaches**
- Leave-one-out cross-validation: Train on N-1 and test on the 1 left out. Repeat for all samples. Good for small datasets (e.g., runs of fMRI data). 
- K-fold cross-validation: Split data into K equal parts. Train on K-N parts and test on the N left out. Repeat K times. Good for larger datasets. (e.g., train on 80% test on 20%)
    - Stratified k-fold: Like k-fold but ensures each fold has the same class distribution as the overall dataset. Good for imbalanced classes. (e.g., samples evenly from each category depending on their proportion in the dataset.)
- Shuffle-split: Randomly split data into training and test sets multiple times. Good for large datasets where k-fold may be computationally expensive.



In [None]:
'''
Train and test on same dataset with cross-validation
'''

models = ['resnet50', 'convnext', 'vit']
#create dataframe to store results for each model
results = pd.DataFrame(columns=['model', 'acc'])

for model in models:
    all_acts, all_labels = concat_acts(f'{git_dir}/demo_data/acts/{model}')

    # decode images using stratified shuffle split and SVC
    n_splits = 100
    test_size = 0.2
    sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=test_size)
    accuracies = []
    for train_index, test_index in sss.split(all_acts, all_labels):
        X_train, X_test = all_acts[train_index], all_acts[test_index]
        y_train, y_test = all_labels[train_index], all_labels[test_index]
        
        
        clf.fit(X_train, y_train)
        acc = clf.score(X_test, y_test)
        accuracies.append(acc)

    print(f'{model} mean accuracy {np.mean(accuracies)}')


resnet50 mean accuracy 0.0008333333333333333
convnext mean accuracy 0.9822222222222223
vit mean accuracy 0.7311111111111112


In [8]:
'''
Train on one dataset and test on another
'''
train_data_name = ''
test_data_names= ['natural','scramble','shape']

models = ['convnext']
for model in models:
    train_acts, train_labels = concat_acts(f'{git_dir}/demo_data/acts/{model}{train_data_name}')
    for test_data_name in test_data_names:
        test_acts, test_labels = concat_acts(f'{git_dir}/demo_data/acts/{model}_{test_data_name}')
        clf.fit(train_acts, train_labels)
        acc = clf.score(test_acts, test_labels)
        print(f'{model} test accuracy on {test_data_name}: {acc}')

convnext test accuracy on natural: 0.9555555555555556
convnext test accuracy on scramble: 0.8777777777777778
convnext test accuracy on shape: 0.46111111111111114


In [None]:
accuracies

In [None]:
#