CountVectorizer returning an array with only zeros #3

galbafoz-d01 · 2021-12-31T08:41:38Z

Hi, everyone!

I am new in NLP and it's the first time I use sklearn vectorizer, following a tutorial with another corpus for sentiment analysis. For some reason the arrays are almost only zeros (a 1 here and there, but very few of them).

The following code is what I used to preprocess the corpus.

def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  pos_counts = Counter()
  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech

def preprocess_text(text):
    
  cleaned = re.sub(r'\W+', ' ', text).lower()
  tokenized = word_tokenize(cleaned)
  lemmatized = [normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized if token not in stopwords.words('english')]
  normalized = ' '.join(lemmatized)
  
  return normalized

And here is the code with the vectorizer (the vocab with the full vocabulary is one of the solutions I found in the threads, but still not working).

pos = open('NLTK/short_reviews/positive.txt', 'r', encoding='latin-1').read()
neg = open('NLTK/short_reviews/negative.txt', 'r', encoding='latin-1').read()

pos_clean = [preprocess_text(sen) for sen in pos.split('\n') if sen != '']
neg_clean = [preprocess_text(sen) for sen in neg.split('\n') if sen != '']

x_clean = pos_clean + neg_clean
labels = [1] * len(pos_clean) + [0] * len(neg_clean)

vocab = []

for sentence in x_clean:
  for word in sentence.split(' '):
    if word not in vocab:
      vocab.append(word)

x_train, x_test, y_train, y_test = train_test_split(
  x_clean, labels, test_size=0.2, random_state=42
)

vectorizer = CountVectorizer(vocabulary=vocab)

x_vec = vectorizer.fit_transform(x_train).toarray()

# xt_vec = vectorizer.transform(x_test).toarray()

with numpy.printoptions(threshold=numpy.inf):
    print(x_vec[0])

Thanks a lot in advance, and please do not hesitate if there is some lack of information!

I hope I can understand what is going on...

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CountVectorizer returning an array with only zeros #3

CountVectorizer returning an array with only zeros #3

galbafoz-d01 commented Dec 31, 2021

CountVectorizer returning an array with only zeros #3

CountVectorizer returning an array with only zeros #3

Comments

galbafoz-d01 commented Dec 31, 2021