<a href="https://colab.research.google.com/github/wlail-iu/D590-NLP-F24/blob/main/WLail_Copy_of_scikit_learn_text_scrivner_D590_NLP_Fall24.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Scikit Practice
Adapted from https://colab.research.google.com/github/RPI-DATA/course-intro-ml-app/blob/master/content/notebooks/16-intro-nlp/03-scikit-learn-text.ipynb

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Methods - Text Feature Extraction with Bag-of-Words Using Scikit Learn


In [2]:
corpus = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining and the weather is sweet'])


In [3]:
len(corpus)

3

## Raw Term Frequency

###  Import Count Vectorizer

- Using the CountVectorizer from scikit-learn, we can construct a bag-of-words model with the term frequencies

- Please read the documentation for more details ([Scikit-learn doc CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)



In [4]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
# tokenize and build vocab
tf = cv.fit_transform(corpus).toarray()
tf

array([[0, 1, 1, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 1, 1],
       [1, 2, 1, 1, 1, 2, 1]])

In [5]:
cv.vocabulary_

{'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}

Based on the vocabulary, the word _and_ is the 1st **feature** in each document vector in tf (index zero).

#### Shape

In [6]:
tf.shape

(3, 7)

We have 3 samples (sentences) and 7 tokens

In [7]:
cv.get_feature_names_out()

array(['and', 'is', 'shining', 'sun', 'sweet', 'the', 'weather'],
      dtype=object)

In [8]:
cv.inverse_transform(tf)

[array(['is', 'shining', 'sun', 'the'], dtype='<U7'),
 array(['is', 'sweet', 'the', 'weather'], dtype='<U7'),
 array(['and', 'is', 'shining', 'sun', 'sweet', 'the', 'weather'],
       dtype='<U7')]

# tf-idf


- The tf-idf rescales words that are common to have less weight
- We can use the _TfidfVectorizer_ to normalize the term frequencies(use_idf: False and smooth_idf=False)

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(use_idf=False, norm='l2', smooth_idf=False)
tf_norm = tfidf.fit_transform(corpus).toarray()
np.set_printoptions(precision=2)
print(f'Normalized term frequencies of document 3: \n  {tf_norm[-1]}')

Normalized term frequencies of document 3: 
  [0.28 0.55 0.28 0.28 0.28 0.55 0.28]


## Term frequency-inverse document frequency -- tf-idf

In [10]:
tfidf = TfidfVectorizer(use_idf=True, smooth_idf=True, norm='l2')
tf_idf = tfidf.fit_transform(corpus).toarray()
print(f'Normalized term frequencies of document 3:\n {tf_norm[-1]}')


Normalized term frequencies of document 3:
 [0.28 0.55 0.28 0.28 0.28 0.55 0.28]


# Bigrams and N-Grams



In [11]:
# look at sequences of tokens of minimum length 2 and maximum length 2
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_vectorizer.fit(corpus)

In [12]:
bigram_vectorizer.get_feature_names_out()

array(['and the', 'is shining', 'is sweet', 'shining and', 'sun is',
       'the sun', 'the weather', 'weather is'], dtype=object)

In [13]:
bigram_vectorizer.transform(corpus).toarray()

array([[0, 1, 0, 0, 1, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1]])

Often we want to include unigrams (single tokens) AND bigrams, wich we can do by passing the following tuple as an argument to the `ngram_range` parameter of the `CountVectorizer` function:

In [14]:
gram_vectorizer = CountVectorizer(ngram_range=(1, 2))
gram_vectorizer.fit(corpus)

In [15]:
gram_vectorizer.get_feature_names_out()

array(['and', 'and the', 'is', 'is shining', 'is sweet', 'shining',
       'shining and', 'sun', 'sun is', 'sweet', 'the', 'the sun',
       'the weather', 'weather', 'weather is'], dtype=object)

In [16]:
gram_vectorizer.transform(corpus).toarray()

array([[0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1],
       [1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1]])

Character n-grams
=================

Sometimes it is also helpful not only to look at words, but to consider single characters instead.   
That is particularly useful if we have very noisy data and want to identify the language, or if we want to predict something about a single word.
We can simply look at characters instead of words by setting ``analyzer="char"``.
Looking at single characters is usually not very informative, but looking at longer n-grams of characters could be:

In [17]:
corpus

array(['The sun is shining', 'The weather is sweet',
       'The sun is shining and the weather is sweet'], dtype='<U43')

In [19]:
char_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
char_vectorizer.fit(corpus)

In [20]:
print(char_vectorizer.get_feature_names_out())

[' a' ' i' ' s' ' t' ' w' 'an' 'at' 'd ' 'e ' 'ea' 'ee' 'er' 'et' 'g '
 'he' 'hi' 'in' 'is' 'n ' 'nd' 'ng' 'ni' 'r ' 's ' 'sh' 'su' 'sw' 'th'
 'un' 'we']
