### NATURAL LANGUAGE PROCESSING
- Load input files and read reviews
- Tokenize
- Stopwords
- Stemming/lemmitization
- Write cleaned data to output file
- Bag of words
- TF-IDF(Term Frequency and Inverse document frequency)

In [1]:
import spacy
from spacy import displacy

In [16]:
!pip3 install -U spacy
!python3 -m spacy download en_core_web_sm

Requirement already up-to-date: spacy in f:\anaconda\lib\site-packages (2.3.2)


Python was not found but can be installed from the Microsoft Store: https://go.microsoft.com/fwlink?linkID=2082640


In [2]:
import en_core_web_sm

In [3]:
nlp = spacy.load('en_core_web_sm')

In [4]:
import spacy
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

#### Lets get the basic understand before we to the project

In [5]:
text = "Apple, This is first sentence. and Google this is another one. here 3rd one is"

In [6]:
doc = nlp(text)

In [7]:
doc

Apple, This is first sentence. and Google this is another one. here 3rd one is

### Tokenize the each word

In [8]:
for token in doc:
    print(token)

Apple
,
This
is
first
sentence
.
and
Google
this
is
another
one
.
here
3rd
one
is


## Get the sentence in the above text 
- parser - which is the anlayising the syntatical analysis on symbols or either NLP ,COMPUTER LANGUAGE ETC..

In [9]:
sent = nlp.create_pipe('sentencizer')

In [10]:
nlp.add_pipe(sent, before='parser')
#Making the sentence syntatically format or structured

In [11]:
doc = nlp(text)

In [12]:
doc

Apple, This is first sentence. and Google this is another one. here 3rd one is

In [16]:
for sent in doc.sents:
    print(sent)

Apple, This is first sentence.
and Google this is another one.
here 3rd one is


### REMOVE STOP_WORDS

In [17]:
from spacy.lang.en.stop_words import STOP_WORDS

In [18]:
stopwords = list(STOP_WORDS)

In [19]:
stopwords

['hereupon',
 'who',
 'have',
 'regarding',
 'by',
 'either',
 '’m',
 'our',
 'since',
 'also',
 'say',
 'us',
 'name',
 'each',
 '‘ve',
 'somewhere',
 'itself',
 'whose',
 'namely',
 'your',
 'part',
 'whom',
 'many',
 'its',
 'nowhere',
 'with',
 'others',
 'it',
 'while',
 'whereafter',
 'mostly',
 'very',
 'one',
 'enough',
 'everything',
 'thereby',
 'had',
 'first',
 'within',
 'beforehand',
 "n't",
 'own',
 'hereby',
 'give',
 'if',
 'call',
 'am',
 'behind',
 'up',
 'over',
 'we',
 'you',
 'until',
 'nine',
 'out',
 'why',
 'but',
 'thru',
 'several',
 'not',
 "'m",
 'so',
 "'ll",
 'her',
 'fifteen',
 'everyone',
 'two',
 'seeming',
 'seems',
 'seemed',
 'although',
 'from',
 'done',
 'too',
 'should',
 'herein',
 'whither',
 'do',
 'just',
 'ourselves',
 'else',
 'before',
 'go',
 'twelve',
 'at',
 'when',
 'serious',
 'might',
 'elsewhere',
 'perhaps',
 'front',
 '’ll',
 '‘d',
 'neither',
 'doing',
 'still',
 'against',
 'here',
 "'re",
 'nor',
 'noone',
 'there',
 'third',
 

In [20]:
len(stopwords)

326

In [21]:
# removing stop words in our sentence
for token in doc:
    if token.is_stop == False:
        print(token)

Apple
,
sentence
.
Google
.
3rd


## LEMMATIZATION

In [22]:
doc = nlp('run runs running runner')


In [23]:
for lem in doc:
    print(lem.text,lem.lemma_)

run run
runs run
running run
runner runner


### POS(PARTS OF SPEECH)

In [25]:
doc = nlp('All is well at your end!')

In [28]:
for token in doc:
    print(token.text,token.pos_)

All DET
is AUX
well ADJ
at ADP
your DET
end NOUN
! PUNCT


In [29]:
displacy.render(doc, style = 'dep')

### ENTITY DETECTION

In [31]:
doc = nlp("New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.")

In [32]:
doc

New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.

In [33]:
displacy.render(doc, style = 'ent')

####                           ** PROJECT ** 
####       ** TEXT CLASSIFICATION (MOVIE REVIEW IMDB AND AMAZON DATASET) **

In [35]:
import pandas as pd

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [44]:
data_yelp = pd.read_csv('NLP(folders)/yelp_label.txt',sep='\t',header=None)

In [45]:
data_yelp.head()

Unnamed: 0,0,1
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [46]:
column_name = ['Review','Sentiment']
data_yelp.columns = column_name

In [47]:
data_yelp.head()

Unnamed: 0,Review,Sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [48]:
data_yelp.shape

(1000, 2)

In [49]:
data_yelp.columns

Index(['Review', 'Sentiment'], dtype='object')

##### LOAD AMAZON DATASET

In [50]:
data_amazon = pd.read_csv('NLP(folders)/amazon_labeled.txt',sep='\t',header=None)

In [51]:
data_amazon.head()

Unnamed: 0,0,1
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [52]:
data_amazon.columns = column_name

In [53]:
data_amazon.head()

Unnamed: 0,Review,Sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [54]:
data_amazon.shape

(1000, 2)

### IMDB DATA_LOAD

In [55]:
data_imdb = pd.read_csv('NLP(folders)/imdb.txt', sep = '\t', header = None)

In [56]:
data_imdb.head()

Unnamed: 0,0,1
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [57]:
data_imdb.columns = column_name

In [58]:
data_imdb.shape

(748, 2)

In [59]:
data_imdb.head()

Unnamed: 0,Review,Sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [60]:
data_yelp.shape

(1000, 2)

In [61]:
data_imdb.shape

(748, 2)

In [62]:
data_amazon.shape

(1000, 2)

### We have 'three' type of data now
-  Now append all the data in a single format

In [63]:
data = data_yelp.append([data_imdb,data_amazon],ignore_index=True)

In [64]:
data.shape

(2748, 2)

In [65]:
data.head()

Unnamed: 0,Review,Sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


### Lets check how many positive and negative values in the data
- 1 --> 1386 (+ve reviews)
- 0 --> 1362 (-ve reviews)

In [67]:
data['Sentiment'].value_counts()

1    1386
0    1362
Name: Sentiment, dtype: int64

In [69]:
## check the null values
data.isnull().sum()

# o/p : no null values in the data

Review       0
Sentiment    0
dtype: int64

### TOKENIZATION

In [75]:
import string

In [76]:
pun = string.punctuation

In [77]:
pun

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

- LOWE THE PUNCTUATION 
- REMOVE STOP WORDS AND NON-PUNCTUATION WORDS TO THE ARRAY
- FINALLY PRINT THE RESULT

In [78]:
def text_cleaning_data(sentence):
    doc = nlp(sentence)
    # LEMMATIZATION
    tokens = []
    for token in doc:
        if token.lemma_!= "-PRON-":
            temp = token.lemma_.lower().strip()
        else:
            temp = token.lower_
        tokens.append(temp)
    # STOP WORDS
    cleaned_tokens = []
    for token in tokens:
        if token not in stopwords and token not in pun:
            cleaned_tokens.append(token)
    return cleaned_tokens

In [81]:
text_cleaning_data("         Hello How are you guys, please like my presentation")

['hello', 'guy', 'like', 'presentation']

### VECTORIZATION ENGINEERING(TF-IDF)

In [82]:
from sklearn.svm import LinearSVC

In [84]:
tfidf = TfidfVectorizer(tokenizer = text_cleaning_data)
classifier = LinearSVC()
#it willm be done according to function(text_cleaning_data)

In [85]:
X = data['Review']
y = data['Sentiment']


In [86]:
X.shape

(2748,)

In [87]:
y.shape

(2748,)

In [88]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=10)

In [89]:
X_train.shape

(2198,)

In [90]:
X_test.shape

(550,)

In [91]:
y_train.shape

(2198,)

In [92]:
y_test.shape

(550,)

In [93]:
clf = Pipeline([('tfidf', tfidf), ('clf', classifier)])

In [94]:
clf.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [95]:
print("Our model is scoring: ",clf.score(X_train,y_train)*100)

Our model is scoring:  97.49772520473158


In [96]:
y_pred = clf.predict(X_test)

In [97]:
y_pred

array([1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1,

In [98]:
confusion_matrix(y_test, y_pred)

array([[212,  67],
       [ 61, 210]], dtype=int64)

In [99]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.76      0.77       279
           1       0.76      0.77      0.77       271

   micro avg       0.77      0.77      0.77       550
   macro avg       0.77      0.77      0.77       550
weighted avg       0.77      0.77      0.77       550



In [101]:
clf.predict(['Wow, this is bad'])

array([0], dtype=int64)

In [102]:
clf.predict(['Worth of watching it. Please like it'])

array([1], dtype=int64)

In [104]:
clf.predict(['Wow, this is amzing lesson i liked it'])

array([1], dtype=int64)

In [105]:
clf.predict(['i hate this movie'])

array([0], dtype=int64)

In [108]:
clf.predict(['i hate this kind of movies but this is an exemption but i like this a little bit'])

array([0], dtype=int64)

In [109]:
clf.predict(['fantastic movie'])

array([1], dtype=int64)