### Text Classification With Machine Learning and SpaCy
+ Text categorization / text classification is the task of assigning predefined categories to documents.
+ Sentiment Analysis
+ Multilabel classification
+ + DataSet source http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

##### Aim is to classify reviews into positive or negative review


In [1]:
# Load EDA packages
import pandas as pd

In [7]:
# Load our dataset
url = 'https://raw.githubusercontent.com/microsoft/ML-Server-Python-Samples/master/microsoftml/202/data/sentiment_analysis/imdb_labelled.txt'
df = pd.read_csv(url,sep='\t',names=['text','label'], header=None)



In [8]:
df

Unnamed: 0,text,label
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1
...,...,...
743,I just got bored watching Jessice Lange take h...,0
744,"Unfortunately, any virtue in this film's produ...",0
745,"In a word, it is embarrassing.",0
746,Exceptionally bad!,0


In [9]:
# Checking for Missing Values
df.isnull().sum()

text     0
label    0
dtype: int64

In [11]:
df['label'].unique()

array([0, 1])

In [12]:
def tokenizer(sentence):
    return sentence.split()

#### Machine Learning With SKlearn

In [14]:
# ML Packages
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression


In [16]:
# Vectorization
vectorizer = CountVectorizer(tokenizer = tokenizer)


In [18]:
# Splitting Data Set
from sklearn.model_selection import train_test_split

In [20]:
# Features and Labels
X = df['text']
ylabels = df['label']

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=42)

In [22]:
# Create the  pipeline to clean, tokenize, vectorize, and classify
classifier = LogisticRegression()
pipe = Pipeline([('vectorizer', vectorizer),
                 ('classifier', classifier)])

In [23]:
# Fit our data
pipe.fit(X_train,y_train)



In [24]:
# Predicting with a test dataset
sample_prediction = pipe.predict(X_test)

In [25]:
# Accuracy
print("Train Accuracy: ",pipe.score(X_test,y_test))


Accuracy:  0.6866666666666666


In [36]:
# Accuracy
print("Test Accuracy: ",pipe.score(X_train,y_train))


Test Accuracy:  0.81438127090301


In [27]:
# Another random review
pipe.predict(["This was a great movie"])

array([1])

In [28]:
example = ["I do enjoy my job",
 "What a poor product!,I will have to get a new one",
 "I feel amazing!"]


In [29]:
pipe.predict(example)

array([1, 1, 1])

#### Using TF-IDF

In [30]:
# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = tokenizer)

In [31]:
# Create the  pipeline to clean, tokenize, vectorize, and classify
pipe_tfid = Pipeline([
                 ('vectorizer', tfvectorizer),
                 ('classifier', classifier)])

In [32]:
pipe_tfid.fit(X_train,y_train)



In [35]:
print("Test Accuracy: ",pipe_tfid.score(X_test,y_test))


Test Accuracy:  0.6933333333333334


In [34]:
# Accuracy
print("Train Accuracy: ",pipe.score(X_train,y_train))

Accuracy:  0.81438127090301


In [None]:
### Jesse JCharis
### J-Secur1ty
### Jesus Saves @ JCharisTech