## Download and extra the data
Data is in a zip file

In [3]:
!wget "https://raw.githubusercontent.com/sziccardi/CSCI4521_DataRepository/refs/heads/main/20news-bydate.tar.gz"

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
!tar -xf 20news-bydate.tar.gz

tar: Error opening archive: Failed to open '20news-bydate.tar.gz'


## Using SciKit-Learn's CountVectorizer


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

The parameter `min_df` controls effect words that are not used frequently (min_df = minimum document frequency).
 - If it is an integer, all words occurring less than that value will be dropped.
 - If it is a fraction, all words that occur less than that fraction of the overall dataset are be dropped.

`max_df` works in a similar manner

In [6]:
vectorizer = CountVectorizer(min_df=1) #min_df=1 --> use all words

In [7]:
CountVectorizer?

[1;31mInit signature:[0m
[0mCountVectorizer[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0minput[0m[1;33m=[0m[1;34m'content'[0m[1;33m,[0m[1;33m
[0m    [0mencoding[0m[1;33m=[0m[1;34m'utf-8'[0m[1;33m,[0m[1;33m
[0m    [0mdecode_error[0m[1;33m=[0m[1;34m'strict'[0m[1;33m,[0m[1;33m
[0m    [0mstrip_accents[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mlowercase[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mpreprocessor[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mtokenizer[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mstop_words[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mtoken_pattern[0m[1;33m=[0m[1;34m'(?u)\\b\\w\\w+\\b'[0m[1;33m,[0m[1;33m
[0m    [0mngram_range[0m[1;33m=[0m[1;33m([0m[1;36m1[0m[1;33m,[0m [1;36m1[0m[1;33m)[0m[1;33m,[0m[1;33m
[0m    [0manalyzer[0m[1;33m=[0m[1;34m'word'[0m[1;33m,[0m[1;33m
[0m    [0m

Consider two sentences:

In [8]:
content = ["How to catch pokemon", "Which Pokemon is the hardest to catch?"]

How many uniuqe words between the two?
  - Is `catch` and `catch?` the same word?
  - Is `Pokemon` and `pokemon` the same word?
  - Would `catch` and `catching` be the same word?

In [9]:
# TODO: fit_transform the sentences then print the vocab
X = vectorizer.fit_transform(content)
vectorizer.get_feature_names_out()

array(['catch', 'hardest', 'how', 'is', 'pokemon', 'the', 'to', 'which'],
      dtype=object)

We can turn each sentence into a "bag of words" ... for each sentence:
 - 1 is word is present
 - 0 is word is absent

In [10]:
print(X.toarray())

[[1 0 1 0 1 0 1 0]
 [1 1 0 1 1 1 1 1]]


### CountVectorizer on UseNet posts

In [None]:
import os
DIR = "/content/20news-bydate-train/rec.sport.hockey"   

In [12]:
posts = [open(os.path.join(DIR, filename)).read() for filename in os.listdir(DIR)]

FileNotFoundError: [WinError 3] The system cannot find the path specified: '/content/20news-bydate-train/rec.sport.hockey'

In [13]:
posts[45]

NameError: name 'posts' is not defined

In [None]:
# TODO: fit_transform the vectorizer with our new data
X_train = vectorizer.fit_transform(posts)

In [None]:
X_train.shape

(600, 12914)

In [None]:
print(vectorizer.get_feature_names_out())

['00' '000' '000256' ... 'zupancic' 'zurich' 'zzzzzz']


In [None]:
# TODO: vectorize the sentence "Should a team be added in Wisconsin?"
new_post = "Should a team be added to Wisconsin?"
new_post_vec = vectorizer.transform([new_post])
print(new_post_vec)

  (0, 1900)	1
  (0, 2610)	1
  (0, 10730)	1
  (0, 11577)	1
  (0, 11807)	1
  (0, 12665)	1


### Finding Nearest Neighbors

`new_post_vec` is a feature vector, and we can try to find its nearest neighbors in the training set

In [None]:
import numpy as np

In [None]:
def dist_raw(v1, v2):
  delta = v1-v2
  return np.linalg.norm(delta)

In [None]:
# TODO: find the distances between the new post and the vectors in our training set

In [None]:
# TODO: which post is the closest?

Hmm. The querry document was `"Should a new team be added to Wisconsin?"`.

Does this post seem related to our query feature? Let's check which elements of the feature vectors overlap.

In [None]:
# TODO: print the query vector and the closest vector?

That worked porly... There is no overlap in features. What happened?

#### Normalized distance
Normalizing vectors before computing distance focuses on document content rather than length

In [None]:
def dist_norm(v1, v2):
  v1_normalized = v1/np.linalg.norm(v1) #Normalize vectors to unit length
  v2_normalized = v2/np.linalg.norm(v2)
  delta = v1_normalized-v2_normalized   #Then take distance
  return np.linalg.norm(delta)

In [None]:
# TODO: find the normalized distances between the new post and the vectors in our training set then find the new closest post

## Stop Words, Stemming, and TF-IDF
Ignoring common words (stop words)

In [None]:
vectorizer = CountVectorizer(min_df=1, stop_words='english')

We'll lose some words now. The size of the feature vector should be smaller.

In [None]:
X_train.shape #Old Vectorizations

In [None]:
X_train = vectorizer.fit_transform(posts)

In [None]:
X_train.shape #New Vectorizations

In [None]:
# TODO: based on a new query post, which post in our dataset is closest?

How did this do?

We can can also add stemming and tf-idf:

In [None]:
import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')

In [None]:
class StemmedCountVectorizer(CountVectorizer):
   def build_analyzer(self):
     analyzer = super(StemmedCountVectorizer, self).build_analyzer()
     return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

In [None]:
vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')
X_train = vectorizer.fit_transform(posts)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
  def build_analyzer(self):
    analyzer = super(TfidfVectorizer, self).build_analyzer()
    return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

In [None]:
vectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english')
X_train = vectorizer.fit_transform(posts)

In [None]:
# TODO: with these new vectorizers, lets test that query post again

#### Cosine Similarity

We can use the cosine similarity instead of the normalized vector distance.

But remember to maximize similarity vs. minimize distance.

In [None]:
def cos_similarity(v1, v2):
  v1_n = v1/(np.linalg.norm(v1))
  v2_n = v2/(np.linalg.norm(v2))
  return np.vdot(v1_n,v2_n)

In [None]:
# TODO: use cosine similarity as a distance metric and try the query post again

## Closest Document Function

In [None]:
# Helper function!
def findClosestStory(promt):
  new_post_vec = vectorizer.transform([promt])
  dists = [cos_similarity(new_post_vec.toarray(),train_vec.toarray()) for train_vec in X_train]
  closest_id = np.argmax(dists) #switch to arg max!
  return posts[closest_id]
