Most classic machine learning algorithms can't take in raw text. Instead we need to perform a feature "extraction" from the raw text in order to pass numerical features to the machine learning algorithm. For example, we could count the occurance of each word to map text to a number.

**Count Vectorization**

In [1]:
messages = ["Hey, How are you today?","Boss is calling you.","AXA is the biggest insurance company in the world."]

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
vect= CountVectorizer()

In [6]:
X = vect.fit_transform(messages)
vect.get_feature_names()

['are',
 'axa',
 'biggest',
 'boss',
 'calling',
 'company',
 'hey',
 'how',
 'in',
 'insurance',
 'is',
 'the',
 'today',
 'world',
 'you']

This gave output of all unique words. All unique words are unique features.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
vectorizer = TfidfVectorizer()
doc_vec = vectorizer.fit_transform(messages)

In [12]:
#creating a dataframe
import pandas as pd
df1 = pd.DataFrame(doc_vec.toarray().transpose(), index = vectorizer.get_feature_names())
print(df1)

                  0         1         2
are        0.467351  0.000000  0.000000
axa        0.000000  0.000000  0.307461
biggest    0.000000  0.000000  0.307461
boss       0.000000  0.562829  0.000000
calling    0.000000  0.562829  0.000000
company    0.000000  0.000000  0.307461
hey        0.467351  0.000000  0.000000
how        0.467351  0.000000  0.000000
in         0.000000  0.000000  0.307461
insurance  0.000000  0.000000  0.307461
is         0.000000  0.428046  0.233832
the        0.000000  0.000000  0.614922
today      0.467351  0.000000  0.000000
world      0.000000  0.000000  0.307461
you        0.355432  0.428046  0.000000


This above matrix is called ***Sparce Matrix***. This is matrix with lots of zeros.

Read more about [TF-IDF (Term Frequency - Inverse Data Frequency)](https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/) **Very Useful**

TF-IDF allows us to understand the context of words across an entire corpus of documents, instead of just its relative importance in single document.


## Feature Extraction from Text

Now we'll actually use text of each message to perform classification based on content. We'll use scikit learn's [feature extraction tools](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

**Load a dataset**

In [13]:
import pandas as pd
import numpy as np

df= pd.read_csv('smsspamcollection.tsv',sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [20]:
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [21]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [22]:
from sklearn.model_selection import train_test_split

In [23]:
X = df['message']
y = df['label']

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

In [27]:
#fit the vectorizer to the data (build a vocab, count the number of words..)
#count_vect.fit(X_train)
#X_train_counts = count_vect.transform(X_train)

#or you can do both above step by one line 


#transform the original text message to vector
X_train_counts = count_vect.fit_transform(X_train)

In [28]:
X_train_counts

<3900x7263 sparse matrix of type '<class 'numpy.int64'>'
	with 52150 stored elements in Compressed Sparse Row format>

In [29]:
X_train.shape

(3900,)

So if you see above we have 3900 messages in our dataset, and **7263** unique words in that data.

In [30]:
X_train_counts.shape

(3900, 7263)

**now we'll transform the counts to frequencies by tf-idf**

Read more about [TF-IDF (Term Frequency - Inverse Data Frequency)](https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/) **Very Useful**

In [31]:
from sklearn.feature_extraction.text import TfidfTransformer

In [32]:
tfidf_transformer = TfidfTransformer()

In [33]:
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [34]:
X_train_tfidf.shape

(3900, 7263)

**In future you can combine the count vector part and TF-IDF by something called TfidfVectorizer (Convenience)**

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)

In [36]:
from sklearn.svm import LinearSVC #linear Support Vector Classifier

clf = LinearSVC()
clf.fit(X_train_tfidf, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

If you see our training set is only vectorized, but we need to do same vectorization with test set as well. Doing the same process for test set will be tiresome process. But, scikit learn provide something which can do vectorization and classification.

so, we can combine everything into one pipeline object.


In [37]:
from sklearn.pipeline import Pipeline

In [38]:
text_clf = Pipeline([('tfidf',TfidfVectorizer()),('clf', LinearSVC())])

So, what we did was build a pipeline and it accepts list of tuples, So first we name the process (example:tfidf), then we tell what to do in that process (TfidfVectorizer()), and then we tell about the next process to do in that pipeline).

The above Pipeline will behave exactly like a text classifier that we coded above.

**Pipeline is a really convenient way to perform many steps at once**

In [39]:
text_clf.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

In [40]:
#This pipeline is so convenient that I'll predict on pipeline and will give raw input of test data. It will convert the data
#into vector form and will predict the results.
predictions = text_clf.predict(X_test)

In [41]:
from sklearn.metrics import confusion_matrix, classification_report

In [42]:
print(confusion_matrix(y_test, predictions))

[[1445    3]
 [  10  214]]


In [43]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      1.00      1448
        spam       0.99      0.96      0.97       224

    accuracy                           0.99      1672
   macro avg       0.99      0.98      0.98      1672
weighted avg       0.99      0.99      0.99      1672



In [44]:
from sklearn import metrics

metrics.accuracy_score(y_test, predictions)

0.9922248803827751

##### Let's predict a new message

In [45]:
text_clf.predict(["AXA is the largest insurance company in the world"])

array(['ham'], dtype=object)

In [47]:
text_clf.predict(["Congratulations, You've won 4million dollors. You've been selected as winner amongst 5000 people. TEXT WON to 404040."])

array(['spam'], dtype=object)