# Text Classification with SpaCy

A common task in NLP is **text classification**. This is "classification" in the conventional machine learning sense, and it is applied to text. Examples include spam detection, sentiment analysis, and tagging customer queries. 

In this tutorial, you'll learn text classification with spaCy. The classifier will detect spam messages, a common functionality in most email clients. Here is an overview of the data you'll use:

In [None]:
from nltk.tokenize import word_tokenize
sentence = '''
Hello I am Aayush living in canada
but mentally I am wandering in the streets of India
eating american food
living indian life
drinking candian water'''

#tokenisation
tokens = word_tokenize(sentence)
print(tokens)

#lowercase
lowercase_word = []
for word in tokens:
    word = word.lower()
#     print(word)
    lowercase_word.append(word)

lowercase_word

#PorterStemmer
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
for word in lowercase_word:
    tokens = ps.stem(word)
    print(word, ":",tokens)

import nltk
nltk.download('wordnet')#Need to have this for Stemming

# Lemmatizer
from nltk.stem import WordNetLemmatizer
wml = WordNetLemmatizer()
lemma = []
for word in lowercase_word:
    tokens = wml.lemmatize(word)
    lemma.append(tokens)
    print(word,":",tokens)

#Good words
from nltk.corpus import stopwords
filter_words = []
Stopwords = set(stopwords.words('english'))
for word in lemma:
    if word not in Stopwords:
        filter_words.append(word)

filter_words

# Repeat the same for the spam data(filter by label "ham")
# try cleaning the data, find patterns which are evident. like cleaning www., numbers, words which are non english

In [None]:
import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('../input/nlp-course/spam.csv')
spam.head(10)

In [None]:
spam['text'].tolist()[:20]