## Feature extraction from text

This notebook is divided into two sections:
* First, we'll find out what what is necessary to build an NLP system that can turn a body of text into a numerical array of *features* by manually calcuating frequencies and building out TF-IDF.
* Next we'll show how to perform these steps using scikit-learn tools.

# Part One: Core Concepts on Feature Extraction


In this section we'll use basic Python to build a rudimentary NLP system. We'll build a *corpus of documents* (two small text files), create a *vocabulary* from all the words in both documents, and then demonstrate a *Bag of Words* technique to extract features from each document.<br>
<div class="alert alert-info" style="margin: 20px">This first section is for illustration only!
<br>Don't worry about memorizing this code - later on we will let Scikit-Learn Preprocessing tools do this for us.</div>


## Start with some documents:
For simplicity we won't use any punctuation in the text files One.txt and Two.txt. Let's quickly open them and read them. Keep in mind, you should avoid opening and reading entire files if they are very large, as Python could just display everything depending on how you open the file.


In [1]:
with open('..\data\One.txt') as mytext:
    a = mytext.read()
    print(a)

This is a story about dogs
our canine pets
Dogs are furry animals



In [2]:
# readlines returns as list
with open('..\data\One.txt') as mytext:
    a = mytext.readlines()
    print(a)

['This is a story about dogs\n', 'our canine pets\n', 'Dogs are furry animals\n']


### Reading entire text as a string / Membaca keseluruan text

In [3]:
with open('..\data\Two.txt') as mytext:
    entire_text = mytext.read()
    entire_text

In [4]:
print(entire_text)

This story is about surfing
Catching waves is fun
Surfing is a popular water sport



### Reading Each List as list /Membaca Setiap Baris sebagai Daftar

In [5]:
with open('..\data\One.txt') as mytext:
    lines = mytext.readlines()

In [6]:
lines

['This is a story about dogs\n',
 'our canine pets\n',
 'Dogs are furry animals\n']

### Reading in Words Separately / Membaca secara terpisah

In [7]:
with open('..\data\One.txt') as mytext:
    words = mytext.read().lower().split()

In [8]:
words

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

## 2) Building a vocabulary (Creating a "Bag of Words")
Let's create dictionaries that correspond to unique mappings of the words in the documents. We can begin to think of this as mapping out all the possible words available for all (both) documents.

### Read in one text

In [9]:
with open('..\data\One.txt') as mytext:
    words_one = mytext.read().lower().split()

In [10]:
words_one

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

In [11]:
len(words_one)

13

### 2.1) Getting the unique words only

In [12]:
unique_words_one = set(words_one)

In [13]:
unique_words_one

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'dogs',
 'furry',
 'is',
 'our',
 'pets',
 'story',
 'this'}

In [14]:
len(unique_words_one)

12

Now we only have 12 unique words instead of original 13 words in Document one.

### Rapeat for Two text 

In [15]:
with open('..\data\Two.txt') as mytext:
    words_two = mytext.read().lower().split()
    unique_words_two = set(words_two)

In [16]:
len(words_two), len(unique_words_two)

(15, 12)

### 2.2) Get all unique words across all documents (both One and Two)

In [17]:
all_unique_words = set()

all_unique_words.update(unique_words_one)

In [18]:
print(all_unique_words)

{'dogs', 'story', 'are', 'a', 'furry', 'our', 'animals', 'this', 'about', 'pets', 'is', 'canine'}


In [19]:
all_unique_words.update(unique_words_two)

In [20]:
print(all_unique_words)

{'dogs', 'a', 'furry', 'fun', 'our', 'surfing', 'this', 'about', 'is', 'canine', 'sport', 'story', 'are', 'popular', 'catching', 'water', 'animals', 'pets', 'waves'}


### 2.3) Create vocab dictionary with related index

In [21]:
full_vocab = {}
i = 0

for words in all_unique_words:
    full_vocab[words] = i
    i = i+1

In [22]:
full_vocab

{'dogs': 0,
 'a': 1,
 'furry': 2,
 'fun': 3,
 'our': 4,
 'surfing': 5,
 'this': 6,
 'about': 7,
 'is': 8,
 'canine': 9,
 'sport': 10,
 'story': 11,
 'are': 12,
 'popular': 13,
 'catching': 14,
 'water': 15,
 'animals': 16,
 'pets': 17,
 'waves': 18}

### 3)Bag of Words to Frequency Counts
Now that we've encapsulated our "entire language" in a dictionary, let's perform feature extraction on each of our original documents:

### Empty counts per doc

In [23]:
one_freq = [0] * len(full_vocab)
two_freq = [0] * len(full_vocab)
all_words = [''] * len(full_vocab)

In [24]:
one_freq

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [25]:
two_freq

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [26]:
all_words

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

### 3.1) Make A list of All Vocab (which will be used to map later)

In [27]:
for word in full_vocab:
    word_index = full_vocab[word]
    all_words[word_index] = word

In [28]:
print(all_words)

['dogs', 'a', 'furry', 'fun', 'our', 'surfing', 'this', 'about', 'is', 'canine', 'sport', 'story', 'are', 'popular', 'catching', 'water', 'animals', 'pets', 'waves']


### 3.2) Add in counts per word per doc:

In [29]:
with open('../Data/One.txt') as file:
    one_text  = file.read().lower().split()
    
for word in one_text:
    word_index = full_vocab[word] #get the index of that specific word
    one_freq[word_index]+= 1

In [30]:
one_freq

[2, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0]

In [31]:
with open('..\data\Two.txt') as file:
    two_text = file.read().lower().split()
    
for word in two_text:
    word_index = full_vocab[word]
    two_freq[word_index]+=1

In [32]:
two_freq

[0, 1, 0, 1, 0, 2, 1, 1, 3, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1]

### 3.3) Create the DataFrame:

In [33]:
import pandas as pd

In [34]:
pd.DataFrame(data =[one_freq, two_freq], columns = all_words)

Unnamed: 0,dogs,a,furry,fun,our,surfing,this,about,is,canine,sport,story,are,popular,catching,water,animals,pets,waves
0,2,1,1,0,1,0,1,1,1,1,0,1,1,0,0,0,1,1,0
1,0,1,0,1,0,2,1,1,3,0,1,1,0,1,1,1,0,0,1


Now we can how frequently each word appears in the documents.

By comparing the vectors we see that some words are common to both, some appear only in One.txt, others only in Two.txt. Extending this logic to tens of thousands of documents, we would see the vocabulary dictionary grow to hundreds of thousands of words. Vectors would contain mostly zero values, making them `sparse matrices.`

## Concepts to Consider:
## Bag of Words and Tf-idf

Dalam contoh di atas, setiap vektor dapat dianggap sebagai sekumpulan kata. Hal ini mungkin tidak berguna sampai kita mempertimbangkan frekuensi istilah, atau seberapa sering setiap kata muncul dalam dokumen. Cara sederhana untuk menghitung frekuensi istilah adalah dengan membagi jumlah kemunculan suatu kata dengan jumlah total kata dalam dokumen. Dengan cara ini, frekuensi kemunculan sebuah kata dalam dokumen besar dapat dibandingkan dengan frekuensi kemunculan dokumen kecil.

Namun, mungkin sulit untuk membedakan dokumen berdasarkan frekuensi istilah jika sebuah kata muncul di sebagian besar dokumen. Untuk menangani hal ini kami juga mempertimbangkan invers frekuensi dokumen, yaitu jumlah total dokumen dibagi dengan jumlah dokumen yang mengandung kata tersebut. Dalam praktiknya, kami mengonversi nilai ini ke skala logaritmik, seperti yang dijelaskan di sini.

Bersama-sama istilah-istilah ini menjadi tf-idf.

## Stop Words and Word Stems
Beberapa kata seperti "the" dan "and" muncul begitu sering, dan dalam banyak dokumen, sehingga kita tidak perlu repot-repot menghitungnya. Selain itu, mungkin masuk akal jika hanya mencatat akar kata, misalnya `cat`, sebagai ganti `cat` dan `cats`. Ini akan memperkecil susunan kosakata kita dan meningkatkan kinerja.

## Tokenization and Tagging

Saat kami membuat vektor, hal pertama yang kami lakukan adalah membagi teks masuk menjadi spasi dengan `.split()`. Ini adalah bentuk tokenisasi yang kasar - yaitu, membagi dokumen menjadi kata-kata individual. Dalam contoh sederhana ini kami tidak mengkhawatirkan tanda baca atau jenis kata yang berbeda. Di dunia nyata, kami mengandalkan morfologi yang cukup canggih untuk mengurai teks dengan tepat.

Setelah teks dibagi, kita dapat kembali dan menandai token kita dengan informasi tentang jenis kata, ketergantungan tata bahasa, dll. Hal ini menambah lebih banyak dimensi pada data kita dan memungkinkan pemahaman yang lebih mendalam tentang konteks dokumen tertentu. Oleh karena itu, vektor menjadi **high dimensional sparse matrices.**

--------
---
## Part Two: Feature Extraction with Scikit-Learn
Let's explore the more realistic process of using sklearn to complete the tasks mentioned above!

## Scikit-Learn's Text Feature Extraction Options

In [35]:
text = ['This is a line',
           "This is another line",
       "Completely different line"]

## Feature eXtractions

In [42]:
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer, CountVectorizer

## CountVectorizer

* cv akan memperlakukan setiap nilai sebagai satu dokumen
* `fit_transform` pada dasarnya mendapatkan kosakata unik yang sesuai dan kemudian mengubahnya dengan melakukan penghitungan frekuensi secara aktual pada setiap dokumen di dalam daftar itu.
* dan itu mengembalikan `sparse matrix`. Alasannya adalah ketika melakukan vektorisasi dan pembuatan model kantong kata, yang akan terjadi adalah sebagian besar item dalam matriks akan bernilai nol. Jadi ketika Anda berurusan dengan ratusan dan ribuan dokumen dengan banyak kata, Anda ingin memastikan bahwa Anda tidak memakan terlalu banyak memori PC jika tidak perlu hanya dengan menyimpan sekumpulan angka nol. Itulah sebabnya kami memiliki matriks renggang.
* `sparse matriks`  dengan matriks 3x6. Mengapa 3? karena ada 3 dokumen dalam daftar yang kami lewati. Ketika kita menggunakan metode `todense()`, kita dapat melihat jumlah frekuensi yang disimpan asli yang tidak dalam bentuk matriks renggang (yang menyimpan informasi dengan cara yang efisien dalam memori). **CATATAN: kami tidak ingin memanggil metode ini jika kami memiliki nilai kata yang besar, yang akan memakan banyak ruang memori**

In [43]:
cv = CountVectorizer()

In [44]:
sparse_matrix = cv.fit_transform(text)

In [45]:
sparse_matrix

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

### Use todense() to see original form

In [46]:
sparse_matrix.todense()

matrix([[0, 0, 0, 1, 1, 1],
        [1, 0, 0, 1, 1, 1],
        [0, 1, 1, 0, 1, 0]], dtype=int64)

### Vocabulary_

In [47]:
cv.vocabulary_

{'this': 5, 'is': 3, 'line': 4, 'another': 0, 'completely': 1, 'different': 2}

Jika kita perhatikan lebih dekat nilainya, yang lain ada di indeks 0. Jadi jika melihat hasil todense(), lihat indeks 0 memiliki nilai 1 pada dokumen kedua. yang masuk akal karena Ini adalah baris lain adalah dokumen kedua yang memiliki kata lain.

## stop_words parameter
with the use of this parameter, common stop words in English are not longer part of the vocab.

In [49]:
cv = CountVectorizer(stop_words='english')

In [50]:
sparse_matrix = cv.fit_transform(text)

In [51]:
cv.vocabulary_

{'line': 2, 'completely': 0, 'different': 1}

-----
## TfidfTransformer

In [60]:
tfidf_transform = TfidfTransformer()

In [61]:
cv = CountVectorizer()

In [62]:
counts = cv.fit_transform(text)

In [63]:
counts

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [64]:
tfidf = tfidf_transform.fit_transform(counts)

In [65]:
tfidf.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

## Use Pipeline

In [66]:
from sklearn.pipeline import Pipeline

In [67]:
pipe = Pipeline([('cv', CountVectorizer()), ('tfidf',TfidfTransformer())])

In [69]:
result = pipe.fit_transform(text)

In [71]:
result

<3x6 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [72]:
result.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

## TfidfVectorier
Lakukan keduanya di atas dalam satu langkah!

In [73]:
tfidf = TfidfVectorizer()

In [74]:
new = tfidf.fit_transform(text)

In [76]:
new.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])