Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 11 while Y.shape[1] == 1 #11367

Closed
Pechi77 opened this issue Jun 27, 2018 · 6 comments

Comments

@Pechi77
Copy link

Pechi77 commented Jun 27, 2018

Description

Hi I am new to vectorizing and feature extracting, I am trying to find the similarity of a single document against multiple documents.

Steps/Code to Reproduce

here is my code:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer


train_set = ["president of India","machine learning is awesome", "python is awesome", "thanks for reading"]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)
tfidf_matrix_test = tfidf_vectorizer.fit_transform(["president"])
cosine_similarity(tfidf_matrix_train,tfidf_matrix_test)

Expected Results

similarity list=[matching values with each document]

Actual Results

ValueError                                Traceback (most recent call last)
<ipython-input-19-e0da281ca84b> in <module>()
     15 tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)
     16 tfidf_matrix_test = tfidf_vectorizer.fit_transform(["president"])
---> 17 cosine_similarity(tfidf_matrix_train,tfidf_matrix_test)
     18 #finds the tfidf score with normalization

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in cosine_similarity(X, Y, dense_output)
    908     # to avoid recursive import
    909 
--> 910     X, Y = check_pairwise_arrays(X, Y)
    911 
    912     X_normalized = normalize(X, copy=True)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
    120         raise ValueError("Incompatible dimension for X and Y matrices: "
    121                          "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 122                              X.shape[1], Y.shape[1]))
    123 
    124     return X, Y

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 11 while Y.shape[1] == 1
@jnothman
Copy link
Member

Don't use fit_transform a second time if you want the feature representation to be the same for test data. Use transform.

But this is not a forum for usage questions, it is a tracker for software development issues.

@Pechi77
Copy link
Author

Pechi77 commented Jun 27, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jun 27, 2018 via email

@Pechi77
Copy link
Author

Pechi77 commented Jun 27, 2018

could you please give an example or any reference link

@jnothman
Copy link
Member

jnothman commented Jun 27, 2018 via email

@rakeshskc
Copy link

rakeshskc commented Oct 22, 2018

I have made a few changes in your code, changes highlighted in bold

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

train_set = ["president of India","machine learning is awesome", "python is awesome", "thanks for reading"]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)
tfidf_matrix_test = tfidf_vectorizer.transform(["president"])

print(cosine_similarity(tfidf_matrix_train,tfidf_matrix_test))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants