Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CountVectorizer returning an array with only zeros #3

Open
galbafoz-d01 opened this issue Dec 31, 2021 · 0 comments
Open

CountVectorizer returning an array with only zeros #3

galbafoz-d01 opened this issue Dec 31, 2021 · 0 comments

Comments

@galbafoz-d01
Copy link

Hi, everyone!

I am new in NLP and it's the first time I use sklearn vectorizer, following a tutorial with another corpus for sentiment analysis. For some reason the arrays are almost only zeros (a 1 here and there, but very few of them).

The following code is what I used to preprocess the corpus.

def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  pos_counts = Counter()
  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech

def preprocess_text(text):
    
  cleaned = re.sub(r'\W+', ' ', text).lower()
  tokenized = word_tokenize(cleaned)
  lemmatized = [normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized if token not in stopwords.words('english')]
  normalized = ' '.join(lemmatized)
  
  return normalized

And here is the code with the vectorizer (the vocab with the full vocabulary is one of the solutions I found in the threads, but still not working).

pos = open('NLTK/short_reviews/positive.txt', 'r', encoding='latin-1').read()
neg = open('NLTK/short_reviews/negative.txt', 'r', encoding='latin-1').read()

pos_clean = [preprocess_text(sen) for sen in pos.split('\n') if sen != '']
neg_clean = [preprocess_text(sen) for sen in neg.split('\n') if sen != '']

x_clean = pos_clean + neg_clean
labels = [1] * len(pos_clean) + [0] * len(neg_clean)

vocab = []

for sentence in x_clean:
  for word in sentence.split(' '):
    if word not in vocab:
      vocab.append(word)

x_train, x_test, y_train, y_test = train_test_split(
  x_clean, labels, test_size=0.2, random_state=42
)

vectorizer = CountVectorizer(vocabulary=vocab)

x_vec = vectorizer.fit_transform(x_train).toarray()

# xt_vec = vectorizer.transform(x_test).toarray()

with numpy.printoptions(threshold=numpy.inf):
    print(x_vec[0])

Thanks a lot in advance, and please do not hesitate if there is some lack of information!

I hope I can understand what is going on...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant