Most classical Machine Learning algorithm can't take in raw text. Instead we need to perform feature extraction from the raw text in order to pass numerical features to the machine learning algorithm.

In [45]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [63]:
df =  pd.read_csv('smsspamcollection.tsv', sep='\t')

In [64]:
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [65]:
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [66]:
len(df)

5572

We have 5572 rows

In [50]:
df['label'].unique()

array(['ham', 'spam'], dtype=object)

In [51]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

TEXT FEATURE EXTRACTION

In [52]:
X = df['message']
y = df['label']

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state =42)

Next we will do count vectorization, it includes text preprocessing, tokenizing and the ability to filter out stopwords are all included in count vectorizer. It builds a dictionary of features and transforms documents to feature vectors.

In [54]:
from sklearn.feature_extraction.text import CountVectorizer

In [55]:
count_vect = CountVectorizer()

In [62]:
#FIT VECTORIZER TO THE DATA (build a vocab, count the number of words...)
#count_vect.fit(X_train)
# X_train_counts = count_vect.transform(X_train)
# TRANSFORM THE ORIGINAL TEXT MESSAGE --> VECTOR
X_train_counts = count_vect.fit_transform(X_train)

In [22]:
X_train_counts

<3733x7082 sparse matrix of type '<class 'numpy.int64'>'
	with 49992 stored elements in Compressed Sparse Row format>

In [23]:
X_train.shape

(3733,)

Here if you see the above X_train had 3733 documents with 7082 unique words

In [24]:
X_train_counts.shape

(3733, 7082)

In [26]:
from sklearn.feature_extraction.text import TfidfTransformer

In [27]:
tfidf_transformer = TfidfTransformer()

In [28]:
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [29]:
X_train_tfidf.shape

(3733, 7082)

Here the matrix size remains same but now its not just the count, we have taken term frequency and multiplied it by its inverse document frequency

Here in the begininning we first did count vectorization and then we did TFIDF transformation, However we can combine both of these into a single method called TfidfVectorizer

In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [68]:
tfidf_vectorizer = TfidfVectorizer()

In [69]:
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

In [70]:
X_train_tfidf.shape

(3733, 7082)

In [71]:
from sklearn.svm import LinearSVC

In [72]:
clf = LinearSVC()

In [74]:
clf.fit(X_train_tfidf, y_train)

LinearSVC()

In [75]:
from sklearn.pipeline import Pipeline

In [38]:
text_clf = Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())])

In [39]:
text_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [40]:
predictions = text_clf.predict(X_test)

In [41]:
from sklearn import metrics

In [42]:
print(metrics.confusion_matrix(y_test, predictions))

[[1586    7]
 [  12  234]]


In [43]:
print(metrics.classification_report(y_test, predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



print(metrics.accuracy_score(y_test,predictions))

GridSearchcv from sklearn