Most classic machine learning algorithms can’t take in raw text. Instead we need to perform a **feature “extraction”** from the raw text in order to pass numerical features to the machine learning algorithm.   
The steps involved in Feature extraction are given below:



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Install necessary Libraries:**

In [None]:
import os
os.chdir("/content/drive/My Drive/Github/")

In [None]:
import numpy as np
import pandas as pd

**Load the dataset:**

In [None]:
df = pd.read_csv('smsspamcollection.tsv', sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


**Check for missing values:**

In [None]:
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

**Take a quick look at the ham and spam label column:**

In [None]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

**Split the data into train & test sets:**

In [None]:
from sklearn.model_selection import train_test_split

X = df['message']  # this time we want to look at the text
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

**Scikit-learn's CountVectorizer:**

Text preprocessing, tokenizing and the ability to filter out stopwords are all included in CountVectorizer, which builds a dictionary of features and transforms documents to feature vectors.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(3733, 7082)

**Transform Counts to Frequencies with Tf-idf:**

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3733, 7082)

**Combine Steps with TfidVectorizer:**

Combining the CountVectorizer and TfidTransformer steps into one using TfidVectorizer.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

(3733, 7082)