<a href="https://colab.research.google.com/github/trcamnguyen/Logistic-Regression/blob/main/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import math
import pandas as pd
import numpy as np
from collections import Counter
import re
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import random
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from scipy.special import softmax as sftmx #used to test my function, allowed
# fixing random seed for reproducibility
random.seed(123)
np.random.seed(123)

## Load Raw texts and labels into arrays

First, you need to load the training, development and test sets from their corresponding CSV files (tip: you can use Pandas dataframes).

In [None]:
import kagglehub
# Download Sentiment140
path = kagglehub.dataset_download("kazanova/sentiment140")
file_path = path + "/training.1600000.processed.noemoticon.csv"

cols = ['target','id','date','flag','user','text']
df = pd.read_csv(file_path, encoding='ISO-8859-1', names=cols)

# Giữ lại text + label, đổi 4 -> 1
df = df[['text','target']]
df['target'] = df['target'].replace(4,1)

# Lấy subset nhỏ gọn 10,000 dòng
df_small = df.sample(n=10000, random_state=42)

# Chia train/dev/test
train_df, test_df = train_test_split(df_small, test_size=0.2, random_state=42, stratify=df_small['target'])
train_df, dev_df = train_test_split(train_df, test_size=0.1, random_state=42, stratify=train_df['target'])

# Format
data_tr_textlist = train_df[['text']].values.tolist()
ArrayTR = train_df[['target']].to_numpy()

data_dev_textlist = dev_df[['text']].values.tolist()
ArrayDev = dev_df[['target']].to_numpy()

data_test_textlist = test_df[['text']].values.tolist()
ArrayTest = test_df[['target']].to_numpy()

print("Binary dataset size:", len(df_small))
print("Train/Dev/Test:", len(data_tr_textlist), len(data_dev_textlist), len(data_test_textlist))


Binary dataset size: 10000
Train/Dev/Test: 7200 800 2000


# Bag-of-Words Representation


To train and test Logisitc Regression models, you first need to obtain vector representations for all documents given a vocabulary of features (unigrams, bigrams, trigrams).


## Text Pre-Processing Pipeline

To obtain a vocabulary of features, you should:
- tokenise all texts into a list of unigrams (tip: using a regular expression)
- remove stop words (using the one provided or one of your preference)
- compute bigrams, trigrams given the remaining unigrams
- remove ngrams appearing in less than K documents
- use the remaining to create a vocabulary of unigrams, bigrams and trigrams (you can keep top N if you encounter memory issues).


In [None]:
stop_words = ['a','in','on','at','and','or',
              'to', 'the', 'of', 'an', 'by',
              'as', 'is', 'was', 'were', 'been', 'be',
              'are','for', 'this', 'that', 'these', 'those', 'you', 'i',
             'it', 'he', 'she', 'we', 'they', 'will', 'have', 'has',
              'do', 'did', 'can', 'could', 'who', 'which', 'what',
             'his', 'her', 'they', 'them', 'from', 'with', 'its','also','so','there','their','The']

### N-gram extraction from a document

You first need to implement the `extract_ngrams` function. It takes as input:
- `x_raw`: a string corresponding to the raw text of a document
- `ngram_range`: a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes extracting unigrams and bigrams.
- `token_pattern`: a string to be used within a regular expression to extract all tokens. Note that data is already tokenised so you could opt for a simple white space tokenisation.
- `stop_words`: a list of stop words
- `vocab`: a given vocabulary. It should be used to extract specific features.

and returns:

- a list of all extracted features.

See the examples below to see how this function should work.

In [None]:
def extract_ngrams(x_raw, ngram_range=(1,3), token_pattern=r'\b[A-Za-z][A-Za-z]+\b', stop_words= stop_words, vocab=set()):


    ####Tokenisation#######
    com = re.compile(token_pattern)
    x_raw = com.findall(x_raw)
    # vocab = re.findall(token_pattern,vocab)

   # x_raw = list(x_raw.split())     #Split Sentence into seperate words then store as list


 #####Remove Stop words########
    x_raw=[word for word in x_raw if word not in stop_words]
    for word in stop_words:
        for word2 in x_raw:
            if word == word2:
                x_raw.remove(word)



    #print(vocab)

#######Return Vocab#####
    #vocab needs to normalised
    vocab1 = str(vocab)
    vocab1 = re.findall(token_pattern,vocab1)
    commachar = ","
    spacechar = ''
    if ((len(vocab1)) != 0): #Check if set is empty
        vocablist = []
        for word in x_raw:
            for word2 in vocab1:
                if ((word == word2) & (word != commachar) & (word != spacechar)& (word2 not in vocablist)): #remove commas & match words in text &vocab
                    vocablist.append(word.replace(" ", "")) ## remove whitespace characters
                   # print(vocablist)

    noofngrams=[]
    ngrams_list = []
    if ngram_range == (1,3):
        noofngrams= [1,2,3]
    if ngram_range == (1,2):
        noofngrams= [1,2]

    y =[]

    #Extract Ngrams
    for n in noofngrams:
        if (len(vocab)) == 0:
                for num in range(0, len(x_raw)):
                    ngram = ' '.join(x_raw[num:num + n])
                    #if ngram not in ngrams_list:
                    if ((ngram != '')):
                        ngrams_list.append(ngram)
                        if ngram not in y:
                            y.append(ngram)

        if (len(vocab)) != 0:
                for n in range(0,len(vocab1)):
                    for num in range(0, len(vocablist)):
                        if len(vocablist[num]) != 0:
                                ngram = ' '.join(vocablist[num:num + n])

                                if ((ngram != '')): # ((ngram not in ngrams_list) &
                                    ngrams_list.append(ngram)

                                    if ngram not in y:
                                        y.append(ngram)

    x= ngrams_list
    return x,y # y is unique ngram list, x is list including duplicates for vectorisation

In [None]:
x1,y1 = extract_ngrams("this is a great movie to watch",
               ngram_range=(1,3),
               stop_words=stop_words)
print(y1)

['great', 'movie', 'watch', 'great movie', 'movie watch', 'great movie watch']


Note that it is OK to represent n-grams using lists instead of tuples: e.g. `['great', ['great', 'movie']]`

### Create a vocabulary of n-grams

Then the `get_vocab` function will be used to (1) create a vocabulary of ngrams; (2) count the document frequencies of ngrams; (3) their raw frequency. It takes as input:
- `X_raw`: a list of strings each corresponding to the raw text of a document
- `ngram_range`: a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes extracting unigrams and bigrams.
- `token_pattern`: a string to be used within a regular expression to extract all tokens. Note that data is already tokenised so you could opt for a simple white space tokenisation.
- `stop_words`: a list of stop words
- `vocab`: a given vocabulary. It should be used to extract specific features.
- `min_df`: keep ngrams with a minimum document frequency.
- `keep_topN`: keep top-N more frequent ngrams.

and returns:

- `vocab`: a set of the n-grams that will be used as features.
- `df`: a Counter (or dict) that contains ngrams as keys and their corresponding document frequency as values.
- `ngram_counts`: counts of each ngram in vocab

Hint: it should make use of the `extract_ngrams` function.

In [None]:
def get_vocab(X_raw, ngram_range=(1,3), token_pattern=r'\b[A-Za-z][A-Za-z]+\b', min_df=0, keep_topN=0, stop_words=[]):

    #Define empty placeholder variables
    df=  Counter()
    ngram_counts = Counter()
    vocab = set()
    df1 = list()
    df2 = Counter()
    #com = re.compile(token_pattern)
    #X_raw = com.findall(X_raw)
    vocab1 = set()

    for x_raw in X_raw:
        x_raw = str(x_raw)
        com = re.compile(token_pattern)
        x_raw = com.findall(x_raw)
        x_raw = str(x_raw)
        x_raw.replace(" ", "")
        x_raw.replace("[","")
        x_raw.replace("]","")
        y,x = extract_ngrams(x_raw, ngram_range, token_pattern, stop_words)
        #calling extract_ngrams deals with tokenisation,stop words,ngram range and ngram extraction
        for item in x:
            df[item] += 1
            if df[item]> min_df: # Only keep above the minimum
                df2[item] += 1

            if (str(item) not in vocab) & (len(vocab) < keep_topN): #stop adding to vocab once cap is reached
            #vocab.add(str(x))
                 vocab.add(str(item)) #need to remove brackets as i think its passing

        for string in x:
            if string in ngram_counts:
                    ngram_counts[string] += 1 #len 998359
            else:
                    ngram_counts[string] = 1
    #df1.append(df2.most_common(keep_topN)) #Keep only the specified amount of N
    #ngram_counts.append(ngram_counts.most_common(keep_topN))
    #vocab = list(vocab)
    df = df2


#######assign top ngrams to vocab set###########################
   # vocab1 = set()
    for item in ngram_counts.most_common(keep_topN):

            # type(item)) item is a tupple
        if (len(vocab1) < keep_topN)  :
                vocab1.add(item[0])

    vocab = vocab1
    return vocab, df, ngram_counts

Now you should use `get_vocab` to create your vocabulary and get document and raw frequencies of n-grams:

In [None]:
vocab, df, ngram_counts = get_vocab(data_tr_textlist, ngram_range=(1,3), keep_topN=5000, stop_words=stop_words)
print(len(vocab))
print()
print(list(vocab)[:100])
print()
print(df.most_common()[:10])

5000

['threads', 'employees', 'quiet', 'HELP', 'us all', 'new iphone', 'blame', 'even though', 'Done', 'He', 'since not', 'saturday', 'meet', 'think thats', 'OF', 'wrenching', 'but idk', 'failing', 'okay', 'sister', 'Carly', 'ew', 'Working', 'Mother', 'sleep but', 'whats next', 'am just', 'ministry', 'hasnt', 'like go', 'massive', 'wonder if', 'my name', 'my computer', 'hi', 'combination', 'Just got', 'Come back', 'work all', 'emailed', 'Heading', 'if got', 'hours', 'know just', 'forever', 'my neck', 'after hour', 'had much', 'boyfriend', 'single', 'ON', 'non', 'myspace', 'ice cream', 'gay', 'home work', 'wana', 'sweet dreams', 'yr', 'emily', 'NEED', 'last night', 'Pink', 'might', 'exams', 'weeks till', 'York', 'magazine', 'drama', 'launched', 'Lines Vines Trying', 'Trying get', 'wondering', 'im sad', 'bad', 'pink', 'Youtube', 'default', 'America', 'heat', 'lol not my', 'Hope', 'Netherlands', 'pony', 'last few', 'heart', 'Have good', 'me either', 'blessed', 'voice', 'club', 'Thinking'

Then, you need to create vocabulary id -> word and id -> word dictionaries for reference:

In [None]:
list_of_vocab = list(vocab)
vocab_id = {i:list_of_vocab[i] for i in range(len(vocab))}

Now you should be able to extract n-grams for each text in the training, development and test sets:

## Vectorise documents

Next, write a function `vectoriser` to obtain Bag-of-ngram representations for a list of documents. The function should take as input:
- `X_ngram`: a list of texts (documents), where each text (doc) is represented as list of n-grams in the `vocab`
- `vocab`: a set of n-grams to be used for representing the documents

and return:
- `X_vec`: an array with dimensionality N x |vocab| where N is the number of documents and |vocab| is the size of the vocabulary. Each element of the array should represent the frequency of a given n-gram in a document.


In [None]:
#Take training set as input
#divide into doc
#extract each ngram for each doc
#Count ngram in doc using counter
#iterate counter through the vocab and assign counts in that order
def Extract_X_ngram(data_tr_textlist):
    X_ngram =[]
    for line in data_tr_textlist:
        ngram,p = list(extract_ngrams(str(line),stop_words= stop_words))
        X_ngram.append(ngram)
    return X_ngram

X_ngram = Extract_X_ngram(data_tr_textlist)
#print(Counter(X_ngram[0]))

In [None]:
def vectorise(X_ngram, vocab):
    N = len(X_ngram) #Get Dimensions #1400 x 5000 for training set
    SizeOfSet = len(vocab)
    X_vec = np.zeros((N,SizeOfSet))
    counter_vect = Counter()
    i = 0
    temp_loc =0

    for row in X_ngram:
        counter_vect = Counter(row)
        #print(counter_vect)
        for item in row:
            if item in vocab:
                temp_loc= list_of_vocab.index(item) #get vocab_id of word
                X_vec[i][temp_loc] = counter_vect[item]#assign count corresponding to vocabID
        i+=1
    return X_vec

Finally, use `vectorise` to obtain document vectors for each document in the train, development and test set. You should extract both count and tf.idf vectors respectively:

#### Count vectors

In [None]:
X_tr_count = vectorise(X_ngram,vocab)
print(X_tr_count)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [None]:
X_tr_count.shape #(1400, 5000)
type(X_tr_count)

numpy.ndarray

In [None]:
X_tr_count[:2,:50]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.]])

# Binary Logistic Regression

After obtaining vector representations of the data, now you are ready to implement Binary Logistic Regression for classifying sentiment.

First, you need to implement the `sigmoid` function. It takes as input:

- `z`: a real number or an array of real numbers

and returns:

- `sig`: the sigmoid of `z`

Then, implement the `predict_proba` function to obtain prediction probabilities. It takes as input:

- `X`: an array of inputs, i.e. documents represented by bag-of-ngram vectors ($N \times |vocab|$)
- `weights`: a 1-D array of the model's weights $(1, |vocab|)$

and returns:

- `preds_proba`: the prediction probabilities of X given the weights

Then, implement the `predict_class` function to obtain the most probable class for each vector in an array of input vectors. It takes as input:

- `X`: an array of documents represented by bag-of-ngram vectors ($N \times |vocab|$)
- `weights`: a 1-D array of the model's weights $(1, |vocab|)$

and returns:

- `preds_class`: the predicted class for each x in X given the weights

To learn the weights from data, we need to minimise the binary cross-entropy loss. Implement `binary_loss` that takes as input:

- `X`: input vectors
- `Y`: labels
- `weights`: model weights
- `alpha`: regularisation strength

and return:

- `l`: the loss score

Now, you can implement Stochastic Gradient Descent to learn the weights of your sentiment classifier. The `SGD` function takes as input:

- `X_tr`: array of training data (vectors)
- `Y_tr`: labels of `X_tr`
- `X_dev`: array of development (i.e. validation) data (vectors)
- `Y_dev`: labels of `X_dev`
- `lr`: learning rate
- `alpha`: regularisation strength
- `epochs`: number of full passes over the training data
- `tolerance`: stop training if the difference between the current and previous validation loss is smaller than a threshold
- `print_progress`: flag for printing the training progress (train/validation loss)


and returns:

- `weights`: the weights learned
- `training_loss_history`: an array with the average losses of the whole training set after each epoch
- `validation_loss_history`: an array with the average losses of the whole development set after each epoch

## Train and Evaluate Logistic Regression with Count vectors

First train the model using SGD:

In [None]:
##Calculate X_dev_count & X_test_count
XDev_ngram = Extract_X_ngram(data_dev_textlist)
X_dev_count = vectorise(XDev_ngram,vocab)

XTest_ngram = Extract_X_ngram(data_test_textlist)
X_test_count = vectorise(XTest_ngram,vocab)

In [None]:
# Huấn luyện Logistic Regression với sklearn
clf = LogisticRegression(
    # C=0.05,
    # max_iter=1000,          # số vòng lặp tối đa
    # solver="lbfgs",         # bộ giải (tối ưu hóa) phổ biến
    # multi_class="auto"      # auto: binary hoặc multinomial đều chạy
)
clf.fit(X_tr_count, ArrayTR.ravel())  # train

# Dự đoán trên tập validation/dev
y_pred_dev = clf.predict(X_dev_count)

print("Classification Report (Dev set):")
print(classification_report(ArrayDev, y_pred_dev))

print("Confusion Matrix (Dev set):")
print(confusion_matrix(ArrayDev, y_pred_dev))


Classification Report (Dev set):
              precision    recall  f1-score   support

           0       0.72      0.69      0.71       400
           1       0.70      0.73      0.72       400

    accuracy                           0.71       800
   macro avg       0.71      0.71      0.71       800
weighted avg       0.71      0.71      0.71       800

Confusion Matrix (Dev set):
[[277 123]
 [108 292]]


In [None]:
# 1. Câu mới nhập
new_sentence = ["best tweet ever!"]

# 2. Chuyển câu mới thành n-gram giống như khi train
new_sentence_ngram = Extract_X_ngram(new_sentence)

# 3. Vectorize bằng hàm thủ công
X_new = vectorise(new_sentence_ngram, list_of_vocab)

# 4. Dự đoán nhãn
y_pred_new = clf.predict(X_new)

print("🔮 Dự đoán nhãn cho câu:", new_sentence[0])
print("👉 Kết quả:", y_pred_new[0])


🔮 Dự đoán nhãn cho câu: best tweet ever!
👉 Kết quả: 1


# Multi-class Logistic Regression

Now you need to train a Multiclass Logistic Regression (MLR) Classifier by extending the Binary model you developed above. You will use the MLR model to perform topic classification on the AG news dataset consisting of three classes:

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# ======================
# 1. Chọn categories
# ======================
categories = ['sci.med','rec.sport.hockey', 'sci.space', 'talk.religion.misc', 'comp.graphics']

newsgroups_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    remove=('headers','footers','quotes')
)
newsgroups_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    remove=('headers','footers','quotes')
)

# Gộp train + test lại để chọn subset ~3000
all_texts = newsgroups_train.data + newsgroups_test.data
all_labels = list(newsgroups_train.target) + list(newsgroups_test.target)

# ======================
# 2. Lấy ngẫu nhiên 3000 samples
# ======================
subset_texts, _, subset_labels, _ = train_test_split(
    all_texts,
    all_labels,
    train_size=4000,
    stratify=all_labels,
    random_state=42
)

# ======================
# 3. Chia thành train/dev/test (70/15/15)
# ======================
X_train, X_temp, y_train, y_temp = train_test_split(
    subset_texts, subset_labels,
    test_size=0.3, random_state=42, stratify=subset_labels
)
X_dev, X_test, y_dev, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5, random_state=42, stratify=y_temp
)

# ======================
# 4. Tạo DataFrame như code gốc
# ======================
Topic_data_tr = pd.DataFrame({"label": y_train, "text": X_train})
Topic_data_dev = pd.DataFrame({"label": y_dev, "text": X_dev})
Topic_data_test = pd.DataFrame({"label": y_test, "text": X_test})

# ======================
# 5. Đồng thời tạo list + numpy array (format cũ)
# ======================
topic_tr_textlist = Topic_data_tr[["text"]].values.tolist()
ArrayTR_Topic = Topic_data_tr[["label"]].to_numpy()

topic_dev_textlist = Topic_data_dev[["text"]].values.tolist()
ArrayDev_Topic = Topic_data_dev[["label"]].to_numpy()

topic_test_textlist = Topic_data_test[["text"]].values.tolist()
ArrayTest_Topic = Topic_data_test[["label"]].to_numpy()

# ======================
# 6. Kiểm tra
# ======================
print("Train size:", len(Topic_data_tr))
print("Dev size:", len(Topic_data_dev))
print("Test size:", len(Topic_data_test))
print("Số lớp:", len(np.unique(ArrayTR_Topic)))

Topic_data_tr.head()


Train size: 2800
Dev size: 600
Test size: 600
Số lớp: 5


Unnamed: 0,label,text
0,3,At one time there was speculation that the fir...
1,0,I am looking for some fast polygon routines (S...
2,1,Can some on e give me some stats on Forsrg in ...
3,3,Regarding the feasability of retrieving the HS...
4,1,-=> Quoting Greg Rogers to All <=-\n GR> Hi al...


In [None]:
print(len(topic_test_textlist))

600


In [None]:
Topic_data_tr.head()

Unnamed: 0,label,text
0,3,At one time there was speculation that the fir...
1,0,I am looking for some fast polygon routines (S...
2,1,Can some on e give me some stats on Forsrg in ...
3,3,Regarding the feasability of retrieving the HS...
4,1,-=> Quoting Greg Rogers to All <=-\n GR> Hi al...


In [None]:
topic_vocab, topic_df, topic_ngram_counts_tr = get_vocab(topic_tr_textlist, ngram_range=(1,3), keep_topN=5000, stop_words=stop_words)
#topic_vocab, topic_df_dev, topic_ngram_counts_dev = get_vocab(topic_dev_textlist, ngram_range=(1,3), keep_topN=5000, stop_words=stop_words)
#topic_vocab,topic_df_test,topic_ngram_counts_test = get_vocab(topic_test_textlist, ngram_range=(1,3), keep_topN=5000, stop_words=stop_words)

In [None]:
list_of_vocab = list(topic_vocab)
vocab_id_topic = {i:list_of_vocab[i] for i in range(len(topic_vocab))}

In [None]:
print(len(topic_vocab))
print(list(topic_vocab)[:100])
print()
print(topic_df.most_common()[:10])
#print(list_of_vocab[:10])

5000
['quiet', 'diagnosis', 'HELP', 'unc', 'newly', 'even though', 'Suite', 'He', 'nGeorge', 'circle', 'bother', 'meet', 've heard', 'unusual', 'OF', 'nChris', 'okay', 'nI agree', 'craft', 'nDetroit', 'Analysis', 'controlled', 'Centre', 'NYR', 'Young', 'studied', 'massive', 'shameful surrender too', 'resolution', 'combination', 'capabilities', 'npoint', 'wing', 'countries', 'requests', 'difference between', 'library', 'Bay', 'justify', 'hours', 'forever', 'one another', 'involves', 'format', 'engineers', 'single', 'nThanks', 'ON', 'non', 'centers', 'nBy', 'Canadiens', 'acts', 'schools', 'nalso', 'ESPN', 'landing', 'Central', 'last night', 'cult', 'occurred', 'all sorts', 'might', 'farm', 'York', 'magazine', 'effective', 'classic', 'guidelines', 'launched', 'edu au', 'wondering', 'funding', 'bad', 'function', 'delivery', 'America', 'impact', 'powerful', 'heat', 'prevention', 'Hope', 'arbitrary', 'offering', 'heart', 'cl msu', 'nmaybe', 'voice', 'attached', 'heavily', 'would say', 'versu

In [None]:
##Counts

topic_X_tr_ngram = Extract_X_ngram(topic_tr_textlist)
topic_X_tr_count = vectorise(topic_X_tr_ngram,topic_vocab)

topic_XDev_ngram = Extract_X_ngram(topic_dev_textlist)
topic_X_dev_count = vectorise(topic_XDev_ngram,topic_vocab)

topic_XTest_ngram = Extract_X_ngram(topic_test_textlist)
topic_X_test_count = vectorise(topic_XTest_ngram,topic_vocab)

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, confusion_matrix

# ======================
# 1. Train với Count Vectors
# ======================
sgd_count = SGDClassifier(
    loss="log_loss",      # logistic regression
    penalty="l2",         # regularization
    max_iter=1000,
    random_state=42
)
sgd_count.fit(topic_X_tr_count, ArrayTR_Topic.ravel())

# Evaluate trên Dev set
y_dev_pred_count = sgd_count.predict(topic_X_dev_count)
print("=== Logistic Regression (Count Vectors) ===")
print(classification_report(ArrayDev_Topic, y_dev_pred_count))
print(confusion_matrix(ArrayDev_Topic, y_dev_pred_count))


=== Logistic Regression (Count Vectors) ===
              precision    recall  f1-score   support

           0       0.84      0.87      0.85       127
           1       0.88      0.85      0.87       131
           2       0.78      0.85      0.81       130
           3       0.84      0.75      0.79       129
           4       0.74      0.76      0.75        83

    accuracy                           0.82       600
   macro avg       0.82      0.82      0.81       600
weighted avg       0.82      0.82      0.82       600

[[110   3  10   2   2]
 [  1 112   6   5   7]
 [  9   0 110   6   5]
 [  9   4  11  97   8]
 [  2   8   4   6  63]]


In [None]:
# 1. Input mới
new_sentence = ["Astronomers discovered a new exoplanet orbiting a distant star, raising hopes for extraterrestrial life"]

# 2. N-gram hóa (giống lúc train)
new_sentence_ngram = Extract_X_ngram(new_sentence)

# 3. Vectorize với vocab cũ
X_new_count = vectorise(new_sentence_ngram, list_of_vocab)  # Count vector

# 4. Predict với mô hình Count
y_pred_new_count = sgd_count.predict(X_new_count)

print("🔮 Câu:", new_sentence[0])
print("👉 Dự đoán (Count):", y_pred_new_count[0])


🔮 Câu: Astronomers discovered a new exoplanet orbiting a distant star, raising hopes for extraterrestrial life
👉 Dự đoán (Count): 3


Now you need to compute the categorical cross entropy loss (extending the binary loss to support multiple classes).

In [None]:
# Lấy lại categories bạn dùng khi load dataset
categories = ['sci.med','rec.sport.hockey', 'sci.space', 'talk.religion.misc', 'comp.graphics']

# Load dataset train
newsgroups_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    remove=('headers','footers','quotes')
)

# In các class số
print("Các class số:", np.unique(newsgroups_train.target))

# In mapping class số -> label
for i, name in enumerate(newsgroups_train.target_names):
    print(f"{i} → {name}")

Các class số: [0 1 2 3 4]
0 → comp.graphics
1 → rec.sport.hockey
2 → sci.med
3 → sci.space
4 → talk.religion.misc


Graph outputted from HPC![NLP1.png](attachment:NLP1.png)