# Beyond Your Local Machine: Leveraging HPC Clusters for Big Data ML Training

**LinkedIn**: 

**Medium**: 

This notebook contains code corresponds to the article above. 

Imagine we are working on a social media analysis project. We have two datasets:

1. An unlabeled dataset (**`test_data`**) (https://www.kaggle.com/datasets/sudishbasnet/truthseekertwitterdataset2023?select=Twitter+Analysis.csv) containing 60,000 rows, where each row represents a social media user with multiple features.
2. A labeled dataset (**`train_data`**) (https://www.kaggle.com/datasets/danieltreiman/twitter-human-bots-dataset) with 25,000 rows.

Our task is to label the instances in `test_data` using the $k$-nearest neighbors algorithm based on the `train_data`. Here's a simplified Python snippet of what this process might look like:

In [1]:
import numpy as np 
import pandas as pd 


features = ['favourites_count', 'followers_count', 'friends_count', 'statuses_count']

traindata = pd.read_csv("datasets/twitter_human_bots_dataset.csv", 
                        usecols=features+['account_type'],
                        index_col=False
                       )
traindata.shape

(37438, 5)

In [2]:
testdata = pd.read_csv("datasets/Twitter Analysis.csv", 
                       usecols=features+['BotScore'], 
                       index_col=False
                      )
testdata = testdata.drop_duplicates()
testdata.shape

(112566, 5)

In [5]:
def find_neighbour(query_vector, target_dataset, k=30):
    # Perform a similarity search
    index = faiss.IndexFlatL2(target_dataset.shape[1])
    index.add(target_dataset)
    
    distances, indices = index.search(query_vector, k)
    
    return indices

In [6]:
from tqdm.auto import tqdm
import faiss

labels = []
for i, row in tqdm(testdata.iterrows(), total=len(testdata), desc="Processing the test dataset"):
    ids = find_neighbour(row[features].values.reshape(1, -1), traindata[features])
    
    labels.append(traindata.loc[ids[0], "account_type"].mode()[0])

Processing the test dataset:   0%|          | 0/112566 [00:00<?, ?it/s]