# Bag of Words model sebagai representasi text

#### Bag of words menyederhankan representasi text sebagai sekumpulan kata serta mengabaikan grammar dan posisi tiap kata pada kalimat. text akan dikonversi menjadi lowercase dan tanda baca diabaikan

In [1]:
corpus = [
    'Linux has been around since the mid-1990s.',
    'Linux distributions include the linux kernel.',
    'Linux is one of the most promient open-source software'
]

corpus

['Linux has been around since the mid-1990s.',
 'Linux distributions include the linux kernel.',
 'Linux is one of the most promient open-source software']

## Bags of words model dengan COUNTVECTORIZER

### Bag of Words model dapat diterapkan dengan memanfaatkan CountVectorizer

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
# hasil dari fit_transform itu di konversi ke array 2 dimensi menggunakan method todense() 
vectorizer_X = vectorizer.fit_transform(corpus).todense()
vectorizer_X

# 3 baris itu menrepresentasikan 3 kalimat di corpus

matrix([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
        [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]],
       dtype=int64)

In [27]:
vectorizer.get_feature_names_out()

# sudah lowercase

array(['1990s', 'around', 'been', 'distributions', 'has', 'include', 'is',
       'kernel', 'linux', 'mid', 'most', 'of', 'one', 'open', 'promient',
       'since', 'software', 'source', 'the'], dtype=object)

## Euclidean Distance untuk mengukur kedekatan/jarak antar dokumen (Vector)

In [None]:
# Error
from sklearn.metrics.pairwise import euclidean_distances

for i in range(len(vectorizer_X)):
    for j in range(i, len(vectorizer_X)):
        if i == j:
            continue
        jarak = euclidean_distances(vectorizer_X[i].reshape(1, -1), vectorizer_X[j].reshape(1, -1))
        print(f'Jarak dokumen {i+1} dan {j+1} : {jarak[0][0]}')

## Stop Word Filtering pada text

#### Stop Word Filtering menyederhankan representasi text dengan mengabaikan beberapa kata seperti determiners(the, a, an), auxiliary verbs(do, be, will), dan prepositions(on, in at)

In [30]:
corpus

['Linux has been around since the mid-1990s.',
 'Linux distributions include the linux kernel.',
 'Linux is one of the most promient open-source software']

## Stop Word Filtering dengan CountVectorizer

#### Stop Word Filtering juga dapat diterapkan dengan memanfaatkan CountVectorizer

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectorizer_X = vectorizer.fit_transform(corpus).todense()
vectorizer_X

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]], dtype=int64)

In [33]:
vectorizer.get_feature_names_out()

array(['1990s', 'distributions', 'include', 'kernel', 'linux', 'mid',
       'open', 'promient', 'software', 'source'], dtype=object)