## **NLP Tutorial: Text Representation - Bag Of Words (BOW)**

In [111]:
import pandas as pd
import numpy as np

In [112]:
df = pd.read_csv("/spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [113]:
df.Category.value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
ham,4825
spam,747


In [114]:
df['spam']=df['Category'].apply(lambda x:1 if x=='spam' else 0)

or else you can use this function to create new spam column

In [None]:
def get_spam_number(x):
  if x == 'spam':
    return 1
  else:
    return 0

In [115]:
df.shape

(5572, 3)

In [116]:
df.head(10)

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


### **Train test split**

In [117]:
from sklearn.model_selection import train_test_split

In [118]:
#create training and testing test
X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam, test_size=0.2)

#x,y independant,dependant variables
#test size 0.2 train size 0.8

we use capital X on independant variable (predictor variable", "regressor") cause there can be one or multiple variables/columns. and for the dependant variable (response variable, measured variable) we use simple y cause we predict only one variable using X.

In [119]:
X_train.shape

(4457,)

In [120]:
X_test.shape

(1115,)

In [121]:
type(X_train)

In [122]:
X_train[:4]

Unnamed: 0,Message
2766,and picking them up from various points
29,Ahhh. Work. I vaguely remember that! What does...
3568,Collect your VALENTINE'S weekend to PARIS inc ...
1616,Mm i had my food da from out


In [128]:
X_train[:][2041]

'You always make things bigger than they are'

In [129]:
type(y_train)

In [130]:
y_train[:4]

Unnamed: 0,spam
2766,0
29,0
3568,1
1616,0


In [131]:
type(X_train.values)

numpy.ndarray

In [132]:
(X_train.values)

array(['and  picking them up from various points',
       'Ahhh. Work. I vaguely remember that! What does it feel like? Lol',
       "Collect your VALENTINE'S weekend to PARIS inc Flight & Hotel + £200 Prize guaranteed! Text: PARIS to No: 69101. www.rtf.sphosting.com",
       ..., 'Great! How is the office today?',
       "Don't worry, * is easy once have ingredients!",
       "(You didn't hear it from me)"], dtype=object)

## **Create bag of words representation using CountVectorizer**

for more follow the doc: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [133]:
from sklearn.feature_extraction.text import CountVectorizer

In [134]:
v = CountVectorizer()

X_train_cv = v.fit_transform(X_train.values)
X_train_cv
#this creates a sparse matrix, a big matrix

<4457x7714 sparse matrix of type '<class 'numpy.int64'>'
	with 59031 stored elements in Compressed Sparse Row format>

In [135]:
X_train_cv.toarray()
#converting the sparse matrix to an array
#it generates 2 dimentional numpy arrray

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [136]:
X_train_cv.toarray()[:2]
#view first 2 arrays

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [137]:
X_train_cv.toarray()[:2][0]
#view the first sample only

array([0, 0, 0, ..., 0, 0, 0])

In [138]:
X_train_cv.toarray()[:1]

array([[0, 0, 0, ..., 0, 0, 0]])

In [139]:
X_train_cv.shape
#here we have 7798 unique words in my vocabullary

(4457, 7714)

In [140]:
v.get_feature_names_out().shape

(7714,)

In [141]:
v.get_feature_names_out()
#this gives the entire created vocabullary

array(['00', '000', '000pes', ..., 'zyada', 'ú1', '〨ud'], dtype=object)

In [142]:
v.get_feature_names_out()[10:30]

array(['02072069400', '02073162414', '02085076972', '021', '03', '04',
       '0430', '05', '050703', '0578', '06', '07', '07008009200',
       '07090201529', '07090298926', '07099833605', '07123456789',
       '0721072', '07732584351', '07734396839'], dtype=object)

In [143]:
v.get_feature_names_out()[1000:1050]

array(['apeshit', 'aphex', 'apo', 'apologetic', 'apologise', 'apologize',
       'apology', 'app', 'apparently', 'appeal', 'appear', 'appendix',
       'applausestore', 'applebees', 'application', 'apply', 'applyed',
       'applying', 'appointment', 'appointments', 'appreciate',
       'appreciated', 'approaches', 'approaching', 'appropriate',
       'approved', 'approx', 'apps', 'appt', 'appy', 'april', 'aproach',
       'apt', 'aptitude', 'aquarius', 'ar', 'arab', 'arabian', 'arcade',
       'ard', 'are', 'area', 'aren', 'arent', 'arestaurant', 'aretaking',
       'areyouunique', 'argentina', 'argh', 'argue'], dtype=object)

In [144]:
dir(v)
#shows all the methods support by this CountVectorizer() variable

['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__sklearn_clone__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_build_request_for_signature',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_feature_names',
 '_check_n_features',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_doc_link_module',
 '_doc_link_template',
 '_doc_link_url_param_generator',
 '_get_default_requests',
 '_get_doc_link',
 '_get_metadata_request',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_parameter_constraints',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_sort_features',
 '_stop_words_id',
 '_validate_data',
 '_

In [150]:
v.vocabulary_

{'and': 951,
 'picking': 5183,
 'them': 6796,
 'up': 7149,
 'from': 3025,
 'various': 7215,
 'points': 5263,
 'ahhh': 860,
 'work': 7562,
 'vaguely': 7204,
 'remember': 5662,
 'that': 6783,
 'what': 7444,
 'does': 2374,
 'it': 3755,
 'feel': 2805,
 'like': 4109,
 'lol': 4182,
 'collect': 1884,
 'your': 7681,
 'valentine': 7206,
 'weekend': 7409,
 'to': 6903,
 'paris': 5062,
 'inc': 3644,
 'flight': 2899,
 'hotel': 3509,
 '200': 346,
 'prize': 5399,
 'guaranteed': 3273,
 'text': 6760,
 'no': 4778,
 '69101': 587,
 'www': 7612,
 'rtf': 5813,
 'sphosting': 6354,
 'com': 1894,
 'mm': 4517,
 'had': 3304,
 'my': 4650,
 'food': 2936,
 'da': 2132,
 'out': 4994,
 'oh': 4897,
 'baby': 1183,
 'of': 4878,
 'the': 6787,
 'house': 3516,
 'how': 3520,
 'come': 1900,
 'you': 7677,
 'dont': 2398,
 'have': 3367,
 'any': 984,
 'new': 4740,
 'pictures': 5188,
 'on': 4920,
 'facebook': 2741,
 'ever': 2667,
 'thought': 6834,
 'about': 748,
 'living': 4155,
 'good': 3196,
 'life': 4097,
 'with': 7516,
 'perfe

In [155]:
v.get_feature_names_out()[7700]

'yup'

In [156]:
X_train_np_array = X_train_cv.toarray()
X_train_np_array
#converted sparse matrix as an array added to a variable
#easy to view

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

get an idea how this arrays works

In [157]:
X_train_np_array[0]

array([0, 0, 0, ..., 0, 0, 0])

In [158]:
X_train_np_array[:0]

array([], shape=(0, 7714), dtype=int64)

In [159]:
X_train_np_array[:1]

array([[0, 0, 0, ..., 0, 0, 0]])

In [160]:
X_train_np_array[1]

array([0, 0, 0, ..., 0, 0, 0])

In [161]:
X_train_np_array[:6]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [162]:
X_train_np_array[:6][0]

array([0, 0, 0, ..., 0, 0, 0])

In [163]:
X_train_np_array[6]

array([0, 0, 0, ..., 0, 0, 0])

done

In [77]:
X_train[:4]
#this is the first 4 samples. 1st sample is 2041

Unnamed: 0,Message
2041,You always make things bigger than they are
1153,Ok i go change also...
1345,Were somewhere on Fredericksburg
5288,An excellent thought by a misundrstud frnd: I ...


In [78]:
X_train[:4][2041]
#1st sample is 2041. view it the email.

'You always make things bigger than they are'

In [74]:
np.where(X_train_np_array[0]!=0)
#the which word values becomes 1 on the 1st sample/1st email.
#it means this series gives the word id (unique id creaetd for the position it appears on the voacbullary) that are contain in the sample 1 (1st email)

(array([ 954, 1073, 1385, 4392, 6844, 6883, 6889, 7761]),)

You always make things bigger than they are  ->  [ 954, 1073, 1385, 4392, 6844, 6883, 6889, 7761]

**keep in mind these id are not in order as the sentence word sequence.**

In [80]:
np.where(X_train_np_array[0]==0)
#this contains the word id's for words that are excluded (not within the) from 1st sample

(array([   0,    1,    2, ..., 7795, 7796, 7797]),)

#### **let's check it**

In [103]:
X_train_np_array[0][954],X_train_np_array[0][1073],X_train_np_array[0][1385],X_train_np_array[0][4392],X_train_np_array[0][6844],X_train_np_array[0][6883],X_train_np_array[0][6889],X_train_np_array[0][7761]

(1, 1, 1, 1, 1, 1, 1, 1)

In [98]:
v.get_feature_names_out()[954], v.get_feature_names_out()[1073],v.get_feature_names_out()[1385],v.get_feature_names_out()[4392],v.get_feature_names_out()[6844],v.get_feature_names_out()[6883],v.get_feature_names_out()[6889],v.get_feature_names_out()[7761]

('always', 'are', 'bigger', 'make', 'than', 'they', 'things', 'you')

In [108]:
x = [X_train_np_array[0][954],
     X_train_np_array[0][1073],
     X_train_np_array[0][1385],
     X_train_np_array[0][4392],
     X_train_np_array[0][6844],
     X_train_np_array[0][6883],
     X_train_np_array[0][6889],
     X_train_np_array[0][7761]
     ]

x

[1, 1, 1, 1, 1, 1, 1, 1]

**it checked**

## **Train the naive bayes model**

In [175]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_cv, y_train)

In [176]:
X_test_cv = v.transform(X_test)

### **Evaluate Performance**

In [177]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test_cv)

print(classification_report(y_test, y_pred))

ValueError: X has 3645 features, but MultinomialNB is expecting 7714 features as input.

**Hypothesis**

The error "ValueError: X has 3645 features, but MultinomialNB is expecting 7714 features as input." arises because the model was trained on data with 7714 features (likely your X_train_cv), but you are trying to predict on data (X_test_cv) that has only 3645 features.

This discrepancy is usually caused by applying different or incomplete preprocessing steps to the training and testing data. It's likely that the vectorizer (v in your code) used to transform X_test into X_test_cv was either:

**1)Fit on a different dataset:** The vectorizer v might have been fit on a dataset with a different vocabulary or feature set than the one used to train the model.

**2)Partially fit:** If v is a CountVectorizer or TfidfVectorizer, it might have been fit with a limited vocabulary size or other restrictions that resulted in fewer features being extracted from X_test.

**Suggested Changes**

To fix this error, you need to ensure that the X_test_cv data has the same number of features as the data the model was trained on (X_train_cv). This can be achieved by using the same vectorizer instance (with the same vocabulary) that was used during training to transform both the training and testing data.

Here's a potential solution assuming that v is the vectorizer instance:

In [178]:
# Assuming 'v' is the vectorizer instance used during training

# Fit the vectorizer on the training data (if not already done)
# v.fit(X_train)  # Only if 'v' was not already fit during training

# Transform both training and testing data using the same vectorizer instance
X_train_cv = v.transform(X_train)
X_test_cv = v.transform(X_test)

# Train the model
model = MultinomialNB()
model.fit(X_train_cv, y_train)

# Make predictions
y_pred = model.predict(X_test_cv)

# Print the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.97      0.98       958
           1       0.85      0.97      0.90       157

    accuracy                           0.97      1115
   macro avg       0.92      0.97      0.94      1115
weighted avg       0.97      0.97      0.97      1115



Explanation of changes:

**1) Consistent Vectorization:** This code ensures that both the training and testing data are transformed using the same vectorizer instance (v). By doing so, you guarantee that both datasets will have the same features and the model can make predictions without encountering the ValueError.

**2) Fitting the Vectorizer (if needed):** If the vectorizer v wasn't previously fit on your training data during the training phase, you should include the line v.fit(X_train) before transforming the data. This step ensures the vectorizer learns the vocabulary from your training data.

By implementing this solution, you should be able to resolve the ValueError and successfully predict on your test data.

**Sources**

stackoverflow.com/questions/72395184/error-multinomialnb-is-expecting-xxx-features-as-input
365datascience.com/question/valueerror-x-has-1460-features-but-multinomialnb-is-expecting-3925-features-as-input/
stackoverflow.com/questions/74585736/multinomial-naive-bayes-cant-use-validation-dataset-because-of-valueerror-but-c
365datascience.com/question/your-model-doesn-t-appear-to-be-balanced/
scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [179]:
#test the model
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1])

## **Train the model using sklearn pipeline and reduce number of lines of code**

In [180]:
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [181]:
clf.fit(X_train, y_train)

In [182]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       958
           1       0.97      0.95      0.96       157

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115

