# IMPORTING THE LIBRARIES

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix
from sklearn import metrics

# READING THE DATA FROM tsv 

In [2]:
data = pd.read_csv(r"C:/Users/G.VENKATARAMANA/NLP/UPDATED_NLP_COURSE/TextFiles/moviereviews.tsv",sep = '\t')

# GLIMPSE OF THE DATA

In [3]:
data.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [4]:
data.shape

(2000, 2)

# CHECKING FOR NULL VALUES 

In [5]:
data.isnull().any()

label     False
review     True
dtype: bool

# DROPPING THE ROWS WITH NULL VALUES

In [6]:
data.dropna(inplace = True)

In [7]:
data.isnull().any()

label     False
review    False
dtype: bool

# CHECKING FOR EMPTY SPACES IN THE REVIEW COLUMN

Sometimes in text databases, the null values are replaced with empty spaces to ignore the column.

These type of data can manipulate the model in wrong predictions.

So, we are removing the empty values from the dataframe.

Checking for the empty space with .isspace() method for review column and 

we are appending the index position of such column to blanks []  list.

In [8]:
blanks = []
for i,lab,rev in data.itertuples():
    if rev.isspace():
        blanks.append(i)

In [9]:
blanks

[57,
 71,
 147,
 151,
 283,
 307,
 313,
 323,
 343,
 351,
 427,
 501,
 633,
 675,
 815,
 851,
 977,
 1079,
 1299,
 1455,
 1493,
 1525,
 1531,
 1763,
 1851,
 1905,
 1993]

In [10]:
data['review'][343]

'  '

# DROPPING THE ROWS WITH EMPTY VALUES

Just passing the index position of the empty spaces columns will drop them.

inplace = True is for the permanent drop.

In [11]:
data.drop(blanks,inplace = True)

In [12]:
data.shape

(1938, 2)

# DIVIDING THE VALUES INTO X AND Y

In [13]:
x = data['review']
y = data['label']

x.shape,y.shape

((1938,), (1938,))

# SPLITTING THE VALUES INTO TRAINING DATA AND TESTING DATA

In [14]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 117)

In [15]:
x_train.shape,x_test.shape,y_train.shape,y_test.shape
x_train.dtype

dtype('O')

# CREATING A PIPELINE OBJECT

The pipeline object will help in reducing outr task in applying the vectorizer and model twice for training and test.

By just calling the pipeline object, both the tasks are done.

They are vectorized and fitted into the model.

Even for prediction also, If we don't approach this method then we have to vectorize the testing data again and predict the values.


pipeline object makes ease of access.

In [16]:
clf_mod = Pipeline([      ('tfidf',TfidfVectorizer())   ,    ('mod',LinearSVC())     ])

# FITTING THE MODEL

In [17]:
clf_mod.fit(x_train,y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('mod',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

# PREDICTING THE MODEL

In [39]:
pred =clf_mod.predict(x_test)

# PRINTING THE CONFUSION MATRIX

In [40]:
print(confusion_matrix(y_test,pred))

[[243  57]
 [ 33 249]]


# CHECKING FOR THE ACCURACY OF THE MODEL

In [41]:
print(f'Accuracy of the model is :{metrics.accuracy_score(pred,y_test)}')

Accuracy of the model is :0.845360824742268


In [42]:
print(f'{metrics.classification_report(pred,y_test)}')

              precision    recall  f1-score   support

         neg       0.81      0.88      0.84       276
         pos       0.88      0.81      0.85       306

    accuracy                           0.85       582
   macro avg       0.85      0.85      0.85       582
weighted avg       0.85      0.85      0.85       582



# CREATING A DATAFRAME OBJECT WITH SAMPLE OF REVIEWS FROM THE WEB

Took some movie reviews from the google to test for real life data.

It predicted as follows.

In [43]:
col = ['review']
rev = [
    
    #One 
    ["""Like many writers, I tend to subconsciously disown anything I’ve written more than a few months ago, so I read this question, in practice, as what’s my favorite thing I’ve written recently. On that front, I’d say that the review of “Phantom Thread” that I wrote over at my blog comes the closest to what I most desire to do as a critic. I try to think about a movie from every front: how the experience is the result of each aspect, in unique quantities and qualities, working together. It’s not just that the acting is compelling or the score is enveloping, it’s that each aspect is so tightly wound that it’s almost indistinguishable from within itself. A movie is not an algebra problem. You can’t just plug in a single value and have everything fall into place.

“Phantom Thread” is Paul Thomas Anderson’s dreamy cinematography. It is Jonny Greenwood’s impeccably seductive, baroque music. It is Vicky Krieps’s ability to perfectly shatter our preconceptions at every single turn as we realize that Alma is the movie’s actual main character. We often talk about how good films would be worse-off if some part of it were in any way different. In the case of “Phantom Thread,” you flat-out can’t imagine how it would even exist if these things were changed. When so many hot take thinkpieces try to explain away every ending or take a hammer to delicate illusions, it was a pleasure to try and understand how a movie like this one operates on all fronts to maintain an ongoing sense of mystique."""],
       
       
       
       
       #Two
       ["""I don’t know if it’s my best work, but a landmark in my life as a critic was surely a review of Chaplin’s “The Circus,” in time for the release of its restoration in 2010. I cherish this piece, written for Slant Magazine, for a number of reasons. For one, I felt deeply honored to shed more light on probably the least known and least respected of Chaplin’s major features, because it’s a film that demonstrates such technical virtuosity it dispels once and for all any notion that his work is uncinematic. (Yes, but what about the rest of his filmography you ask? My response is that any quibbles about the immobility of Chaplin’s camera suggest an ardent belief that the best directing equals the most directing.) For another, I was happy this review appeared in Slant Magazine, a publication that helped me cut my critical teeth and has done the same for a number of other critics who’ve gone on to write or edit elsewhere. That Slant is now struggling to endure in this financially ferocious landscape for criticism is a shame – the reviews I wrote for them around 2009-10 helped me refine my voice even that much more than my concurrent experience at Entertainment Weekly, where I had my day job. And finally, this particular review will always mean a lot to me because it’s the first one I wrote that I saw posted in its entirety on the bulletin board at Film Forum. For me, there was no surer sign that “I’d made it”.

"""],
    
    #Three
["""No way would I dare to recommend any pieces of my own, but I don’t mind mentioning a part of my work that I do with special enthusiasm. Criticism, I think, is more than the three A’s (advocacy, analysis, assessment); it’s prophetic, seeing the future of the art from the movies that are on hand. Yet many of the most forward-looking, possibility-expanding new films are in danger of passing unnoticed (or even being largely dismissed) due to their departure from familiar modes or norms, and it’s one of my gravest (though also most joyful) responsibilities to pay attention to movies that may be generally overlooked despite (or because of) their exceptional qualities. (For that matter, I live in fear of missing a movie that needs such attention.)

But another aspect of that same enthusiasm is the discovery of the unrealized future of the past—of great movies made and seen (or hardly seen) in recent decades that weren’t properly discussed and justly acclaimed in their time.”. Since one of the critical weapons used against the best of the new is an ossified and nostalgic classicism, the reëvaluation of what’s canonical, the acknowledgment of unheralded masterworks—and of filmmakers whose careers have been cavalierly truncated by industry indifference—is indispensable to and inseparable from the thrilling recognition of the authentically new."""],
       
       
       #Four
       ["""’ll always love this classic, not so much for its content, but for its position in film history:

http://hollywoodandfine.com/the-...

Though it may be difficult to verify, I believe this was the first negative review for The Dark Knight Rises to be published following the film’s release. What followed was a tirade of angry fans who, despite not having seen the film yet, sent the critics death threats for their opinions of the movie. The Dark Knight Rises incident remains entrenched in my mind as an amateur movie reviewer— an example of just how willfully ignorant, narrow-minded, shallow, undemanding, and uncritical audiences today are. The review, and reviews like it, contributed greatly to my slow realization that superhero movies are absolute garbage.

In the category of reviews that received unwarranted blowback from stupid people, I’d also include this one: 'Ghostbusters' reboot a horrifying mess

And then there’s always the wonderful RedLetterMedia, the only YouTube critics worth watching. Their review of Man of Steel brings warmth to my heart.

"""],
       
        #Five
       
       ["worst movie of all time in tollywood,did'nt expect such performance from the star. Even the budget was high, no good content from the film."],
    
    #Six
    
    ["Very good film"]]
data2 = pd.DataFrame(rev,columns = col)
data2

Unnamed: 0,review
0,"Like many writers, I tend to subconsciously di..."
1,"I don’t know if it’s my best work, but a landm..."
2,No way would I dare to recommend any pieces of...
3,"’ll always love this classic, not so much for ..."
4,"worst movie of all time in tollywood,did'nt ex..."
5,Very good film


# PREDICTING THEM WITH THE TRAINED MODEL

In [44]:
print(clf_mod.predict(data2['review']))

['neg' 'pos' 'pos' 'pos' 'neg' 'pos']
