In [1]:
import sklearn
from sklearn.datasets import load_files

In [6]:
moviedir ='dataset/movie_review'

In [7]:
movie_train = load_files(moviedir, shuffle=True)

In [8]:
len(movie_train.data)

2000

In [9]:
movie_train.target_names

['neg', 'pos']

In [15]:
# First 500 words from first movie set.... 
print("content: ",movie_train.data[0][:500])
print("***********")
print("filename: ", movie_train.filenames[0])
print("***********")
print("polarity: ", movie_train.target[0])

content:  b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so cal"
***********
filename:  dataset/movie_review\neg\cv405_21868.txt
***********
polarity:  0


#### TEXT vs Numerical data corundum.. 

All ML models seems to work on numerical data but our data set consists of text what to do ???? Do we have solution to handle this problem ???

We have till now developed a hypothesis about Text and numerical data conversion. 

1. Use CountVectorizer to convert text data into numerical data from sklearn.feature_extraction.text import CountVectorizer

2. Use some technique to represent the numerical corpus into 3D space... we have seen tfIdf, word2vec and one hot encoding..

https://github.com/saurav-joshi/statistical-inferences/blob/master/vectorization.ipynb

https://github.com/saurav-joshi/statistical-inferences/blob/master/tfidf.ipynb

http://localhost:8888/notebooks/wordvectors.ipynb


In [22]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import nltk

movie_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize)
movie_counts = movie_vec.fit_transform(movie_train.data)

tfidf_transformer = TfidfTransformer()
movie_tfidf = tfidf_transformer.fit_transform(movie_counts)



In [24]:
### Print some results .... 
print("Corpora shape, ",movie_tfidf.shape)
print ("********")
print ("vector representation of text")
print(movie_tfidf.toarray())

Corpora shape,  (2000, 25313)
********
vector representation of text
[[ 0.          0.          0.03844713 ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 ..., 
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]]


##### The above steps created for us the numerical representation of the data that can be used for building classifier BUT the job is half done we need to create the classifier and get check how good we are doing.... 

A smiliar exercise was done in https://github.com/saurav-joshi/statistical-inferences/blob/master/PCA.ipynb

For the present problem start with Navie Bayes ??? wait a sec Naive Bayes has been regarded as the vanilla algorithm and there lot many SO CALLED better Algos are available so why Naive Bayes ???

How do we benchmark Algorithms ????

https://github.com/saurav-joshi/statistical-inferences/blob/master/more_supervised_algorithms.ipynb

In [26]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
docs_train, docs_test, y_train, y_test = train_test_split(
    movie_tfidf, movie_train.target, test_size = 0.20, random_state = 12)


In [28]:
clf = MultinomialNB().fit(docs_train, y_train)

In [30]:
y_pred = clf.predict(docs_test)
sklearn.metrics.accuracy_score(y_test, y_pred)

0.82250000000000001

In [31]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[176,  30],
       [ 41, 153]], dtype=int64)

#### We have build a MODEL that can perform sentiment analysis.... lets try it on some corpus

### Unseen data classification which the classifier will get from the field directly... 

In [33]:
reviews_new = ['This movie was excellent', 'Absolute joy ride', 
            'Steven Seagal was terrible', 'Steven Seagal shined through.', 
              'This was certainly a movie', 'Two thumbs up', 'I fell asleep halfway through', 
              "We can't wait for the sequel!!", '!', '?', 'I cannot recommend this highly enough', 
              'instant classic.', 'Steven Seagal was amazing. His performance was Oscar-worthy.']
reviews_new_counts = movie_vec.transform(reviews_new)
reviews_new_tfidf = tfidf_transformer.transform(reviews_new_counts)

In [34]:
pred = clf.predict(reviews_new_tfidf)

In [35]:
for review, category in zip(reviews_new, pred):
    print('%r => %s' % (review, movie_train.target_names[category]))

'This movie was excellent' => pos
'Absolute joy ride' => pos
'Steven Seagal was terrible' => neg
'Steven Seagal shined through.' => neg
'This was certainly a movie' => neg
'Two thumbs up' => neg
'I fell asleep halfway through' => neg
"We can't wait for the sequel!!" => neg
'!' => neg
'?' => neg
'I cannot recommend this highly enough' => pos
'instant classic.' => pos
'Steven Seagal was amazing. His performance was Oscar-worthy.' => neg
