Objetivo desse experimento é conhecer o CountVectorizer do scikit-learn, usando-o numa pequena amostra do dataset IMDB e codificando funções equivalente no Python.


Funções a serem implementadas:

1. vocab = build_vocab(corpus)
2. corpus_tok = tokenizer(corpus, vocab)
3. doc_term = feature(corpus_tok)

Enquanto está depurando o seu programa, utilize um corpus bem pequeno, com poucos exemplos e depois de depurado, rode ele nos 1000 exemplos do imdb_sample.

## Usando o exemplo do scikit-learn:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import re
import numpy as np


In [None]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]


In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()
print(vocab)



['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


## Mostrando o Document-term também denominado de "bag of words"

In [None]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


## Minha implementação de um tokenizador simples usando o vocabulário já extraído pelo scikit-learn

Primeira versão: usando for simples




In [None]:
list_word_based = []
list_token_based = []
for amostra in corpus:
    amostra = re.sub(r'\W',' ',amostra).strip().lower()
    list_words = amostra.split(' ')
    list_tokens = []
    for word in list_words:
        list_tokens.append(vocab.index(word))
    list_word_based.append(list_words)
    list_token_based.append(list_tokens)
list_word_based, list_token_based

([['this', 'is', 'the', 'first', 'document'],
  ['this', 'document', 'is', 'the', 'second', 'document'],
  ['and', 'this', 'is', 'the', 'third', 'one'],
  ['is', 'this', 'the', 'first', 'document']],
 [[8, 3, 6, 2, 1], [8, 1, 3, 6, 5, 1], [0, 8, 3, 6, 7, 4], [3, 8, 6, 2, 1]])

Segunda versão: for com list comprehension




In [None]:
list_word_based = []
list_token_based = []
for amostra in corpus:
    amostra = re.sub(r'\W',' ',amostra).strip().lower()
    list_words = amostra.split(' ')
    list_tokens = [vocab.index(word)   for word in list_words]
    list_word_based.append(list_words)
    list_token_based.append(list_tokens)
list_word_based, list_token_based

([['this', 'is', 'the', 'first', 'document'],
  ['this', 'document', 'is', 'the', 'second', 'document'],
  ['and', 'this', 'is', 'the', 'third', 'one'],
  ['is', 'this', 'the', 'first', 'document']],
 [[8, 3, 6, 2, 1], [8, 1, 3, 6, 5, 1], [0, 8, 3, 6, 7, 4], [3, 8, 6, 2, 1]])

# Download do dataset do IMDB_sample (apenas 1000 exemplos)

O dataset está sendo carregado dos datasets disponibilizados pelo curso fast.ai: https://course.fast.ai/datasets.html

O comando wget busca o arquivo imdb.tgz
O comando tar descomprime o arquivo no diretório local

In [None]:
!wget -nc http://files.fast.ai/data/examples/imdb_sample.tgz
!tar -xzf imdb_sample.tgz

--2020-03-10 22:33:26--  http://files.fast.ai/data/examples/imdb_sample.tgz
Resolving files.fast.ai (files.fast.ai)... 67.205.15.147
Connecting to files.fast.ai (files.fast.ai)|67.205.15.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 571827 (558K) [application/x-gtar-compressed]
Saving to: ‘imdb_sample.tgz’


2020-03-10 22:33:26 (1.74 MB/s) - ‘imdb_sample.tgz’ saved [571827/571827]



O diretório descomprimido tem um arquivo no formato csv:

In [None]:
!ls imdb_sample

texts.csv


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('imdb_sample/texts.csv')
df.shape

(1000, 3)

In [None]:
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


## Teste da primeira função para construir o vocabulário. 

In [None]:
list_w = ' '.join(corpus).lower()
list_regex = re.findall(r'\w+', list_w)
list_vocab = []
for word in list_regex:
  if word not in list_vocab:
    list_vocab.append(word)
list_vocab

['this', 'is', 'the', 'first', 'document', 'second', 'and', 'third', 'one']

## Criação da função para construção vocabulario

In [None]:
def build_vocab(corpus_test):
  vocab = []
  corpus_test = ' '.join(corpus_test).lower()
  list_regex = re.findall(r'\w+', corpus_test)
  for word in list_regex:
    if word not in vocab:
      vocab.append(word)
  return vocab

### Testes

In [None]:
build_vocab(['Isso e um teste, com virgulas, e pontos. ', 'Segundo teste. Sem virgulas. Com pontos '])

['isso', 'e', 'um', 'teste', 'com', 'virgulas', 'pontos', 'segundo', 'sem']

In [None]:
build_vocab(corpus)

['this', 'is', 'the', 'first', 'document', 'second', 'and', 'third', 'one']

In [None]:
len(build_vocab(df['text'].values))

18705

## Construção da função para o tokenizador

In [None]:
def tokenizer(corpus, vocab):
  corpus_token = []
  for sentence in corpus:
    list_token = []
    list_word = re.findall(r'\w+', sentence.lower())
    for w in list_word:
      list_token.append(vocab.index(w))
    corpus_token.append(list_token)
  return corpus_token

## Teste

In [None]:
voc = build_vocab(corpus)
tokenizer(corpus,voc)

[[0, 1, 2, 3, 4], [0, 4, 1, 2, 5, 4], [6, 0, 1, 2, 7, 8], [1, 0, 2, 3, 4]]

In [None]:
corpus_imdb = df['text'].values

In [None]:
corpus_imdb

array(["Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!",
       'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two cou

In [None]:
voc = build_vocab(corpus_imdb)
list_tokenized = tokenizer(corpus_imdb,voc)

In [None]:
list_tokenized

In [None]:
len(list_tokenized)

1000

## Construção da Função de Feature


In [None]:
def feature(corpus_tok):
  max_word = 0
  for element in corpus_tok:
    for i in element:
      if i>max_word:
        max_word = i
  features = np.zeros((len(corpus_tok),max_word+1)) #Soma-se 1 pois a lista começa em zero. 
  print(features.shape)
  for index, element in enumerate(corpus_tok):
    for token in element:
      features[index,token] += 1
  return features

In [None]:
doc_term = feature(list_tokenized)

(1000, 18705)


In [None]:
doc_term.shape

(1000, 18705)

In [None]:
(doc_term[940,30])

78.0

## Testando com CountVectorizer do Sklearn

In [None]:
X = vectorizer.fit_transform(corpus_imdb)
vocab = vectorizer.get_feature_names()
feature_sk = X.toarray()


### CountVectorizer

In [None]:
print("Tamanho vocabulario produzido pelo sklearn: {}".format(len(vocab)))
print("Nº de ocorrências da palavra mais frequente : '{}'".format(feature_sk.sum(axis=0).max()))
print("Palavra mais frequente: {}".format(vocab[feature_sk.sum(axis=0).argmax()]))

Tamanho vocabulario produzido pelo sklearn: 18668
Nº de ocorrências da palavra mais frequente : '14507'
Palavra mais frequente: the


## Utilizando as funções construídas

In [None]:
print("Tamanho vocabulario produzido pelo sklearn: {}".format(len(voc)))
print("Nº de ocorrências da palavra mais frequente : '{}'".format(doc_term.sum(axis=0).max()))
print("Palavra mais frequente: {}".format(voc[doc_term.sum(axis=0).argmax()]))

Tamanho vocabulario produzido pelo sklearn: 18705
Nº de ocorrências da palavra mais frequente : '14507.0'
Palavra mais frequente: the


In [None]:
for word in voc:
  if word not in vocab:
    print(word)


t
a
s
i
2
6
8
u
m
3
1
d
h
7
c
4
5
b
j
z
g
w
r
p
9
k
n
o
l
e
x
f
0
y
q
½
v


É possível perceber uma diferença no número de palavras do vocabulário produzido pelo sklearn e produzido pelas funções construidas. Entretanto, utilizando outras métricas consegue-se um resultado próximo do sklearn.
Ao analizar as palavras que diferem dos dois metodos, pôde-se observar um dos motivos da discrepância. Por exemplo, ao utilizar os filtros de regex, o tokenizador reconhece a palavra "T.V" como duas palavras distintas: "t" e "v". 

In [None]:
voc.index('v')
doc_term[:,12591].argmax()

652

In [None]:
corpus_imdb[652]

"Sarah Silverman is a dangerous Bitch! She's beautiful, sexy, funny and talent, dark and demonic. I read the other 'comment' on this show as well as the message board stuff and people just don't get it. Nothing that appears on T.V. is an accident. Too much money, time and work is put into the production of a T.V. show for there to be mistakes. This show is stupid because Sarah wanted it to be stupid. This show is juvenile because Sarah wanted it to be juvenile. I thought the jokes were great and the theme show as well as the other musical numbers are wonderfully bizarre. It's a lot like Pee-Wee's Playhouse for maladjusted, slacker twenty-something glue sniffing, Future Pornstars of America from the Valley. The cast is awesome. The scenarios and action is well-paced. I hope this show succeeds since Comedy Central didn't let David spade keep his show. Who plays Sarah's sister? She not in the cast listing on the show's home page. I would love to see her stand-up. Does anyone know about he