# Report

### Section 1 - Representation and Preprocessing Explanation

For my Naive Bayes Classifier, I have decided to represent the text in each abstract as a counter, where each key represents a words, and maps to an integer value representing the number of times that word appears in abstracts from that given class. This will be able to account for words appearing multiple times throughout an abstract, which typically reinfornces its importance. Using the frequency method compared to the presence/absence method will smooth out noise by amplifying strong signals and dampening weaker signals. 

In terms of preprocessing the text, I began with turning all the words to lowercase, so, for example, For and for will be allocated to the same frequency counter. I tokenize the raw text by utilising the re library's findall function, where I found sequences of word characters that are bounded between non-word characters. This is used to seperate the words in each paragraph, whilst filtering out spaces, punctuation, indentation, etc. 

### Section 2 - Model and Improvements Explanation

For my Standard NBC implementation, the classifier iterates through all of the training data, and for each instance, updates the appropriate project's word counter. Next, we go through each potential classification, and assign all words that appeared through the training data a probability it will occur, given that its respective class has occured. From here, the classifier has all the required parts to utilize Bayes Theorem to make classifications given test data. For each instance we want to classify, we find the probability for it being any of our 4 classes. The NBC then classifies the instance as the most likely project. 

In order to improve the original implementation, I implemented a variety of techniques, including noise fitration, hyperparameter tuning, and n-grams.

- Firstly, I generated a stopword list using ChatGPT, consisting of noisy words such as 'with' or 'and'. I adjusted my preprocessing function to not acknowledge words in this list. This lead to a very slight improvement of roughly ~0.3%. I also attempted to filter out uncommon words, for example words that appeared less than 3 times throughout the entire training set, but this little to no effect. 

- Next, I adjusted the Laplace smoothing equation to use hyperparameter α, where the equation can be given in the form: (count + α)/(count+α|V|). After tuning α, I found that α=6 gave the best results. I also noticed that the training dataset is extremely unbalanced, with 'W': 2303, 'A': 1732, 'G': 232, 'S': 133. Therefore, instead of scaling the original project probabilities by size, I instead normalized them to add up to 1. Overall, this brought my score up to roughly ~97.5%. 

- I then implemented various forms of n-grams in hopes of capturing contextual dependancies. After testing the classifier with solely bigrams, solely trigrams, and various of variations, I found that the most accurate implementation was unigrams and bigrams (Trigrams led to worse classification). By adding bigrams however, this massively expanded my feature space, leading to a much sparser resulting dataset. To combat this, I had to retune my α hyperparamter value. This makes sense, as the feature dimensionality was increased, and a larger value such as 6 wouldn't pick up on useful signals. After re-tuning, I found that α=0.007 lead to the highest accuracy, at roughly ~97.8%. 

- Lastly, I used my print_top_features() function and my compare() function in order determine the most present/important words in identifying each class. With this information, I created a boosting dataset to give certain words more value for certain projects. For example, the bigram "web_security" has an additional weight of 0.9 towards the "S" classification. After some fine-tuning with these booster, I achieved my highest cross-validation score of 98.3%, and my highest Kaggle submission of 98.1%. 

### Section 3 - Evaluation Procedure Explanation

In order to validate my data, I used a 5-fold cross validation. At the expense of requiring more computation and time, this method allowed for all the given data to be used to train and validate, and has lower variance, as the data will (very likely) be split more evenly as the split occurs 5 times, and the results are averaged. Specifically, I used sklearn library's StratifiedKFold method to randomly split my data into 5 folds. Then, on each fold, my model was trained and validated, and the overall accuracy was found by taking the average accuracy for each of the folds. (The accuracy was found by comparing each of my guesses to the correct answer in the given validation set, and finding correct/all).

### Section 4 - Standard / Improved Training/validation Results

include and explain the training/validation results for the standard and improved Naive Bayes model. You can summarize results using tables (or plots), but all results have to be explained descriptively as well.

From my cross-validation tests, we find: <br><br>

5 Fold Cross-Validation Test on Standard Classifier:

Fold 1 Accuracy: 93.75% |
Fold 2 Accuracy: 94.20% |
Fold 3 Accuracy: 94.66% |
Fold 4 Accuracy: 94.77% |
Fold 5 Accuracy: 96.25% |

Average cross-validated accuracy: 94.73%
<br><br><br>
5 Fold Cross-Validation Test on Improved Classifier:

Fold 1 Accuracy: 98.64% |
Fold 2 Accuracy: 97.50% |
Fold 3 Accuracy: 98.52% |
Fold 4 Accuracy: 98.41% |
Fold 5 Accuracy: 98.41% |

Average cross-validated accuracy is 98.30%

# Code

### Classifiers for Kaggle

### Validation Tests Classifiers

In [53]:
#Standard CLASSIFIER

from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import math
import re

def standard_preprocess(text):
    # Lowercase and keep only words
    text = text.lower()
    words = re.findall(r'\b\w+\b', text)
    return words

def StandardClassifier(X_train, X_val, y_train):
    projs = ["A", "S", "G", "W"]
    word_counts = defaultdict(Counter)
    total_words = {"A":0,"S":0,"G":0,"W":0}
    all_words = Counter()
    num_projs = {"A":0,"S":0,"G":0,"W":0}
    total_instances = len(X_train)

    word_counts["A"] = Counter()
    word_counts["S"] = Counter()
    word_counts["G"] = Counter()
    word_counts["W"] = Counter()

    for i in range(len(X_train)):
        pred, desc = y_train[i], X_train[i]

        pp_desc = standard_preprocess(desc)

        word_counts[pred].update(pp_desc)
        total_words[pred] += len(pp_desc)
        all_words.update(pp_desc)
        num_projs[pred] += 1

    vocab_size = len(all_words)

    class_weights = {proj: 1.0 / num_projs[proj] for proj in projs}
    total_weight = sum(class_weights.values()) 
    proj_probs = []
    for proj in projs:
        weighted_prob = class_weights[proj] / total_weight  # Normalize so that sum is 1
        proj_probs.append(weighted_prob)

    word_probs = defaultdict(Counter)
    word_probs["A"] = Counter()
    word_probs["S"] = Counter()
    word_probs["G"] = Counter()
    word_probs["W"] = Counter()

    #Calculating probabilities for each word/bigram
    for proj in projs:
        focus_dict = word_counts[proj]
        denom = total_words[proj] + total_instances
        
        #Calculating likelihoods with Laplace smoothing
        for word in all_words:
            if word not in focus_dict:
                word_probs[proj][word] = 1/denom
            else:
                word_probs[proj][word] = (focus_dict[word]+1)/denom
        
    #Initialzing classification variables
    temp = np.log(proj_probs)
    class_probs = {}
    for i in range(len(temp)):
        class_probs[projs[i]] = temp[i]
    classifications = []

    #Classify test data
    for desc in X_val:
        pp_desc = standard_preprocess(desc)

        cur_class_probs = class_probs.copy()
        for proj in projs:
            #Unigram Probs
            for word in pp_desc:
                cur_class_probs[proj] += np.log(word_probs[proj].get(word, 1 / denom))

        #find project with highest probability, and make that the prediction
        best_proj = max(cur_class_probs, key=cur_class_probs.get)
        classifications.append(best_proj)
    
    return classifications

In [255]:
#Improved CLASSIFIER

from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import math
import re

stop_words = {
    'a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't",
    'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by',
    'can', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing',
    "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', "hadn't", 'has', "hasn't",
    'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's", 'hers', 'herself',
    'him', 'himself', 'his', 'how', "how's", 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is',
    "isn't", 'it', "it's", 'its', 'itself', "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself', 'no',
    'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours', 'ourselves',
    'out', 'over', 'own', 'same', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'so',
    'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then',
    'there', "there's", 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those',
    'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was', "wasn't", 'we', "we'd", "we'll", "we're",
    "we've", 'were', "weren't", 'what', "what's", 'when', "when's", 'where', "where's", 'which', 'while',
    'who', "who's", 'whom', 'why', "why's", 'with', "won't", 'would', "wouldn't", 'you', "you'd", "you'll",
    "you're", "you've", 'your', 'yours', 'yourself', 'yourselves', 'project', 'the_project', ' '
}

log_boosts = {
    #Artificial Intelligence
    "A": {
        "system": 1.2,
        "learning": 1.2,
        "making": 0.7,
        "data": 0.7,
        "decision": 1,
        "agents": 1,
        "real": 0.5,
        "machine": 1,
        "time": 0.4,
        "using": 0.3,
        "control": 0.5,
        "autonomous": 0.6,
        "machine_learning": 1

    },
    #Security
    "S": {
        "security": 1.2,
        "data": 0.8,
        "system": 0.7,
        "authentication": 1.2,
        "web": 0.5,
        "user": 0.7,
        "network": 2,
        "learning": 0.4,
        "secure": 1.5,
        "based": 0.3,
        "web_security": 0.9,
    },
    #Game
    "G": {
        "game": 2,
        "platform": 0.8,
        "multiplayer": 1.5,
        "real": 0.6,
        "experience": 1,
        "time": 0.5,
        "players": 0.5,
        "user": 0.6,
        "time": 0.5
    },
    # Web Development
    "W": {
        "system" : 0.4,
        "user" : 1.2, 
        "website": 1,
        "frontend": 0.6,
        "backend": 0.6,
        "interface": 0.5,
        "react": 0.5,
        "html": 0.5,
        "css": 0.5,
        "api": 0.6,
        "dynamic": 0.5,
        "framework": 0.5,
    }
}


def print_top_features(word_probs, top_n=10):
    for proj in word_probs:
        print(f"\nTop {top_n} features for class '{proj}':")
        most_common = sorted(word_probs[proj].items(), key=lambda x: x[1], reverse=True)[:top_n]
        for word, prob in most_common:
            print(f"  {word}: {prob:.6f}")

def improved_preprocess(text):
    text = text.lower()
    words = [w for w in re.findall(r'\b\w+\b', text) if w not in stop_words]
    bigrams = []
    bigrams = [f"{words[i-1]}_{words[i]}" for i in range(1, len(words))]
    return words + bigrams


def ImprovedClassifier(alpha, X_train, X_val, y_train):
    projs = ["A", "S", "G", "W"]
    word_counts = defaultdict(Counter)
    total_words = {"A":0,"S":0,"G":0,"W":0}
    all_words = Counter()
    num_projs = {"A":0,"S":0,"G":0,"W":0}
    total_instances = len(X_train)

    word_counts["A"] = Counter()
    word_counts["S"] = Counter()
    word_counts["G"] = Counter()
    word_counts["W"] = Counter()

    for i in range(len(X_train)):
        pred, desc = y_train[i], X_train[i]

        pp_desc = improved_preprocess(desc)

        word_counts[pred].update(pp_desc)
        total_words[pred] += len(pp_desc)
        all_words.update(pp_desc)
        num_projs[pred] += 1

    temp = Counter()
    for word, count in all_words.items():
        if "_" in word:
            if count >= 3:
                temp[word] = count
        else:
            temp[word] = count
    all_words = temp

    vocab_size = len(all_words)

    class_weights = {proj: num_projs[proj] / total_instances for proj in projs}
    total_weight = sum(class_weights.values()) 
    proj_probs = []
    for proj in projs:
        weighted_prob = class_weights[proj] / total_weight  # Normalize so that sum is 1
        proj_probs.append(0)

    word_probs = defaultdict(Counter)
    word_probs["A"] = Counter()
    word_probs["S"] = Counter()
    word_probs["G"] = Counter()
    word_probs["W"] = Counter()

    #Calculating probabilities for each word/bigram
    for proj in projs:
        focus_dict = word_counts[proj]
        denom = total_words[proj] + alpha*vocab_size
        
        #Calculating likelihoods with Laplace smoothing
        for word in all_words:
            if word not in focus_dict:
                word_probs[proj][word] = alpha/denom
            else:
                word_probs[proj][word] = (focus_dict[word]+alpha)/denom

    #Initialzing classification variables
    # temp = np.log(proj_probs)
    class_probs = {}
    for i in range(len(projs)):
        class_probs[projs[i]] = 0
    classifications = []

    counter = 0
    #Classify test data
    for desc in X_val:
        counter += 1
        pp_desc = improved_preprocess(desc)

        cur_class_probs = class_probs.copy()
        for proj in projs:
            for word in pp_desc:
                prob = word_probs[proj].get(word, alpha / total_words[proj] + alpha*vocab_size)
                log_boost = log_boosts.get(proj, {}).get(word, 0.0)
                cur_class_probs[proj] += np.log(prob) + log_boost

        # if counter == 36:
        #     print(cur_class_probs)
        #find project with highest probability, and make that the prediction
        best_proj = max(cur_class_probs, key=cur_class_probs.get)
        classifications.append(best_proj)

    # print_top_features(word_probs, 10)

    return classifications



In [None]:
#Create Kaggle Submissions
def Submission(name):
    train = pd.read_csv("train.csv")
    test = pd.read_csv("test.csv")
    X_train = train["Description"].values
    y_train = train["Class"].values
    X_val = test["Description"].values

    classifications = ImprovedClassifier(3, X_train, X_val, y_train)

    #Format data into CSV file
    csv_cols = []
    for i,c in enumerate(classifications):
        csv_cols.append((i+1, c))
    output_df = pd.DataFrame(csv_cols, columns=["Id", "Class"])
    output_df.to_csv(f"{name}", index=False)

Submission("prediction28.csv")



In [None]:
#5 Fold Cross-Validation Method

from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd

def CheckAccuracy(classifications, y_val):
    correct = 0
    for i in range(len(classifications)):
        if classifications[i] == y_val[i]:
            correct += 1
    return correct/len(classifications)
    
    
def CrossValidate(alpha):
    # print(f"5 Fold Cross-Validation Test on Classifier(s):\n")
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    accuracies = []
    df = pd.read_csv("train.csv")
    X = df["Description"].values
    y = df["Class"].values

    for fold, (train_index, val_index) in enumerate(skf.split(X, y)):
        # Split the data
        X_train, X_val = X[train_index], X[val_index]
        y_train, y_val = y[train_index], y[val_index]
        
        # classifications = StandardClassifier(X_train, X_val, y_train)
        # accuracy = CheckAccuracy(classifications, y_val)
        # accuracies.append(accuracy)
        # print(f"Fold {fold+1} Accuracy for Standard Classifier: {accuracy:.2%}.")

        classifications = ImprovedClassifier(alpha, X_train, X_val, y_train)
        accuracy = CheckAccuracy(classifications, y_val)
        accuracies.append(accuracy)
        print(f"Fold {fold+1} Accuracy for Improved Classifier: {accuracy:.2%}.")

    print(f"\n(a = {alpha}) Average cross-validated accuracy is {np.mean(accuracies):.2%}\n")


for i in [0.006]:
    CrossValidate(i)

In [248]:
#Compare Results Function

def Compare(file): 
    data = pd.read_csv("train.csv")
    descs = data["Description"].values
    comparer = pd.read_csv(f"{file}")
    X1 = comparer["Class"].values
    comparee = pd.read_csv("prediction25.csv")
    X2 = comparee["Class"].values

    for i, val in enumerate(X1):
        if val != X2[i]:
            print(f"({i+1}) {file}: {val}, other: {X2[i]}")
            print(descs[i])
            words = improved_preprocess(descs[i])
            print(Counter(words))

name = "testing.csv"
Submission(name)
Compare(name)


(88) testing.csv: S, other: W
our team implemented an autonomous vehicle control system using machine learning algorithms we focused on developing a system that can navigate a vehicle without human intervention using ai based techniques to achieve this we utilized ros robot operating system for the vehicles control and simulation we trained machine learning models such as deep neural networks using tensorflow and keras for tasks like object detection lane detection and decision making we also used opencv for image processing tasks the input to the system was real time sensor data including data from cameras lidar and radar sensors the machine learning algorithms processed this data to detect obstacles identify lane markings and make decisions for steering acceleration and braking the output of the project was a fully functional autonomous vehicle control system that could effectively navigate through different environments including urban and highway scenarios the system could interpre