### Text Classification With Machine Learning and SpaCy
+ Text categorization / text classification is the task of assigning predefined categories to documents.
+ Sentiment Analysis
+ Multilabel classification
+ + DataSet source http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

##### Aim is to classify reviews into positive or negative review


In [1]:
# Load EDA packages
import pandas as pd

In [2]:
# Load our dataset
df_yelp = pd.read_table('yelp_labelled.txt')
df_imdb = pd.read_table('imdb_labelled.txt')
df_amz = pd.read_table('amazon_cells_labelled.txt')


In [3]:
# Concatenate our Datasets
frames = [df_yelp,df_imdb,df_amz]

In [4]:
# Renaming Column Headers
for colname in frames:
    colname.columns = ["Message","Target"]

In [5]:
# Column names
for colname in frames:
    print(colname.columns)

Index(['Message', 'Target'], dtype='object')
Index(['Message', 'Target'], dtype='object')
Index(['Message', 'Target'], dtype='object')


In [6]:
# Assign a Key to Make it Easier
keys = ['Yelp','IMDB','Amazon']

In [7]:
# Merge or Concat our Datasets
df = pd.concat(frames,keys=keys)

In [8]:
# Length and Shape 
df.shape

(2745, 2)

In [9]:
df.head()

Unnamed: 0,Unnamed: 1,Message,Target
Yelp,0,Crust is not good.,0
Yelp,1,Not tasty and the texture was just nasty.,0
Yelp,2,Stopped by during the late May bank holiday of...,1
Yelp,3,The selection on the menu was great and so wer...,1
Yelp,4,Now I am getting angry and I want my damn pho.,0


In [10]:
df.to_csv("sentimentdataset.csv")

In [11]:
# Data Cleaning
df.columns

Index(['Message', 'Target'], dtype='object')

In [12]:
# Checking for Missing Values
df.isnull().sum()

Message    0
Target     0
dtype: int64

###  Working with SpaCy
+ Removing Stopwords
+ Lemmatizing

In [13]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en')

In [14]:
# Build a list of stopwords to use to filter
stopwords = list(STOP_WORDS)

In [15]:
stopwords

['because',
 'around',
 'whither',
 'before',
 'along',
 'must',
 'fifteen',
 'down',
 'first',
 'when',
 'about',
 'once',
 'together',
 'though',
 'did',
 'more',
 'put',
 'are',
 'many',
 'where',
 'bottom',
 'or',
 'hereafter',
 'its',
 'that',
 'does',
 'give',
 'empty',
 'due',
 'almost',
 'behind',
 'becomes',
 'becoming',
 'mostly',
 'name',
 'nothing',
 'our',
 'us',
 'using',
 'wherein',
 'same',
 'between',
 'latterly',
 'own',
 'of',
 'eleven',
 'namely',
 'sixty',
 'four',
 'thus',
 'toward',
 'seemed',
 'up',
 'former',
 'so',
 'which',
 'everyone',
 'off',
 'everything',
 'two',
 'in',
 'if',
 'further',
 'here',
 'will',
 'either',
 'done',
 'beforehand',
 'somewhere',
 're',
 'it',
 'much',
 'at',
 'neither',
 'nobody',
 'formerly',
 'how',
 'since',
 'have',
 'the',
 'whom',
 'elsewhere',
 'mine',
 'across',
 'with',
 'forty',
 'he',
 'nevertheless',
 'this',
 'is',
 'throughout',
 'can',
 'on',
 'moreover',
 'amongst',
 'per',
 'any',
 'twenty',
 'alone',
 'all',
 'l

##### Getting Lemma and Stop words

In [16]:
docx = nlp("This is how John Walker was walking. He was also running beside the lawn.")

In [17]:
# Lemmatizing of tokens
for word in docx:
    print(word.text,"Lemma =>",word.lemma_)
    

This Lemma => this
is Lemma => be
how Lemma => how
John Lemma => john
Walker Lemma => walker
was Lemma => be
walking Lemma => walk
. Lemma => .
He Lemma => -PRON-
was Lemma => be
also Lemma => also
running Lemma => run
beside Lemma => beside
the Lemma => the
lawn Lemma => lawn
. Lemma => .


In [18]:
# Lemma that are not pronouns
for word in docx:
    if word.lemma_ != "-PRON-":
        print(word.lemma_.lower().strip())

this
be
how
john
walker
be
walk
.
be
also
run
beside
the
lawn
.


In [19]:
# List Comprehensions of our Lemma
[word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in docx]

['this',
 'be',
 'how',
 'john',
 'walker',
 'be',
 'walk',
 '.',
 'he',
 'be',
 'also',
 'run',
 'beside',
 'the',
 'lawn',
 '.']

In [20]:
# Filtering out Stopwords and Punctuations
for word in docx:
    if word.is_stop == False and not word.is_punct:
#     if word.is_stop != True and not word.is_punct:
        print(word)

This
John
Walker
walking
He
running
lawn


In [21]:
# Stop words and Punctuation In List Comprehension
[ word for word in docx if word.is_stop == False and not word.is_punct ]

[This, John, Walker, walking, He, running, lawn]

In [22]:
# Use the punctuations of string module
import string
punctuations = string.punctuation

In [23]:
# Creating a Spacy Parser
from spacy.lang.en import English
parser = English()

In [24]:
def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    return mytokens

#### Machine Learning With SKlearn

In [25]:
# ML Packages
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import accuracy_score 
from sklearn.base import TransformerMixin 
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [26]:
#Custom transformer using spaCy 
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

# Basic function to clean the text 
def clean_text(text):     
    return text.strip().lower()

In [27]:
# Vectorization
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) 
classifier = LinearSVC()

In [28]:
# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [29]:
# Splitting Data Set
from sklearn.model_selection import train_test_split

In [30]:
# Features and Labels
X = df['Message']
ylabels = df['Target']

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=42)

In [32]:
# Create the  pipeline to clean, tokenize, vectorize, and classify 
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])

In [33]:
# Fit our data
pipe.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('cleaner', <__main__.predictors object at 0x7fee6cac3f98>), ('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ng...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [34]:
# Predicting with a test dataset
sample_prediction = pipe.predict(X_test)

In [36]:
# Prediction Results
# 1 = Positive review
# 0 = Negative review
for (sample,pred) in zip(X_test,sample_prediction):
    print(sample,"Prediction=>",pred)

Great pork sandwich. Prediction=> 1
It is a true classic.   Prediction=> 0
It's close to my house, it's low-key, non-fancy, affordable prices, good food. Prediction=> 0
Audio Quality is poor, very poor. Prediction=> 0
We loved the biscuits!!! Prediction=> 1
I don't have very many words to say about this place, but it does everything pretty well. Prediction=> 0
Was not happy. Prediction=> 1
The headsets are easy to use and everyone loves them. Prediction=> 1
I miss it and wish they had one in Philadelphia! Prediction=> 0
Still it's quite interesting and entertaining to follow.   Prediction=> 1
All three broke within two months of use. Prediction=> 0
Oh yeah, and the storyline was pathetic too.   Prediction=> 0
IT'S REALLY EASY. Prediction=> 1
Every element of this story was so over the top, excessively phony and contrived that it was painful to sit through.   Prediction=> 0
The food was outstanding and the prices were very reasonable. Prediction=> 1
I am so tired of clichés that is just

In [37]:
# Accuracy
print("Accuracy: ",pipe.score(X_test,y_test))
print("Accuracy: ",pipe.score(X_test,sample_prediction))

Accuracy:  0.7941712204007286
Accuracy:  1.0


In [38]:
# Accuracy
print("Accuracy: ",pipe.score(X_train,y_train))


Accuracy:  0.9849726775956285


In [39]:
# Another random review
pipe.predict(["This was a great movie"])

array([1])

In [40]:
example = ["I do enjoy my job",
 "What a poor product!,I will have to get a new one",
 "I feel amazing!"]
       

In [41]:
pipe.predict(example)

array([1, 0, 1])

In [None]:
#### Using Tfid

In [42]:
# Create the  pipeline to clean, tokenize, vectorize, and classify 
pipe_tfid = Pipeline([("cleaner", predictors()),
                 ('vectorizer', tfvectorizer),
                 ('classifier', classifier)])

In [43]:
pipe_tfid.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('cleaner', <__main__.predictors object at 0x7fee6c98d550>), ('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ng...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [44]:
sample_prediction1 = pipe_tfid.predict(X_test)

In [45]:
for (sample,pred) in zip(X_test,sample_prediction1):
    print(sample,"Prediction=>", pred)

Great pork sandwich. Prediction=> 1
It is a true classic.   Prediction=> 1
It's close to my house, it's low-key, non-fancy, affordable prices, good food. Prediction=> 0
Audio Quality is poor, very poor. Prediction=> 0
We loved the biscuits!!! Prediction=> 1
I don't have very many words to say about this place, but it does everything pretty well. Prediction=> 0
Was not happy. Prediction=> 1
The headsets are easy to use and everyone loves them. Prediction=> 1
I miss it and wish they had one in Philadelphia! Prediction=> 0
Still it's quite interesting and entertaining to follow.   Prediction=> 1
All three broke within two months of use. Prediction=> 0
Oh yeah, and the storyline was pathetic too.   Prediction=> 0
IT'S REALLY EASY. Prediction=> 1
Every element of this story was so over the top, excessively phony and contrived that it was painful to sit through.   Prediction=> 0
The food was outstanding and the prices were very reasonable. Prediction=> 1
I am so tired of clichés that is just

In [46]:
print("Accuracy: ",pipe_tfid.score(X_test,y_test))
print("Accuracy: ",pipe_tfid.score(X_test,sample_prediction1))

Accuracy:  0.7978142076502732
Accuracy:  1.0


In [47]:
### Jesse JCharis
### J-Secur1ty
### Jesus Saves @ JCharisTech