## aim to predict whether a movie review is positive or negative using the text in the review

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

#CountVectorizer is a function that converts text into a term - document matrix
#this function breaks down text into individual terms and stores the occurence of the term in a piece of text as an element of a matrix

In [10]:
file = "imdb_labelled.txt"     #reading the file
names = ['text', 'label']       #determining the columns we want
df = pd.read_csv(file, header = None, names = names, sep = '\t', quoting = 3)
df.head()

Unnamed: 0,text,label
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [19]:
bow = CountVectorizer()
x = bow.fit_transform(df.text)     #conveting the text document into a term - document matrix and storing it in x
x

<1000x3047 sparse matrix of type '<class 'numpy.int64'>'
	with 12666 stored elements in Compressed Sparse Row format>

In [20]:
print(x)

  (0, 1639)	1
  (0, 3037)	1
  (0, 786)	1
  (0, 748)	1
  (0, 37)	1
  (0, 1748)	1
  (0, 92)	1
  (0, 1750)	1
  (0, 2404)	1
  (0, 2871)	3
  (1, 1875)	1
  (1, 2905)	1
  (1, 2969)	1
  (1, 1837)	1
  (1, 1206)	1
  (1, 1777)	1
  (1, 196)	1
  (1, 1862)	1
  (1, 431)	1
  (1, 1035)	1
  (1, 2638)	2
  (1, 1605)	1
  (1, 1733)	1
  (1, 2917)	1
  (1, 2965)	1
  :	:
  (996, 1852)	1
  (996, 2658)	1
  (996, 1358)	1
  (996, 1605)	1
  (996, 2917)	1
  (997, 837)	1
  (997, 3001)	1
  (997, 1428)	1
  (997, 1423)	1
  (997, 1358)	1
  (998, 911)	1
  (998, 222)	1
  (999, 1393)	1
  (999, 1397)	1
  (999, 1316)	1
  (999, 1430)	1
  (999, 1725)	1
  (999, 2921)	1
  (999, 123)	1
  (999, 1854)	1
  (999, 100)	2
  (999, 1358)	1
  (999, 2694)	1
  (999, 125)	1
  (999, 1837)	1


### now we must find those words that are useful

for example - the words "it", "is", "the", "a" are not useful but the words "loved", "great", "hated", "awful" are.
we do this by assigning weights to each word that appears

the basic weight formula would be -- number of times word appears in text / proportion of texts word appears in

so if "loved" appears in the first text once and appears in 1% of all the texts, then its weight would be -- 1 / 1% = 100

and if "movie" appears in the first text once and appears in 33% of all the texts, then its weight would be -- 1 / 33% = 3

but this process might give rare / misspelled words a very high weightage, which wont benefit us and so we need to downweight super frequent words without overweighting rare ones - which is what <b> term frequency - document inverse frequency </b> does

the <b> tf-idf formula</b> is -- number of times word appears in text / log( 1 / proportion of texts word appears in)

the elements in bow represent the (tf) term freq. aka the number of times word appears in text, so now we need to find the idf 

scikit learn is super kind and has a function much like CountVectorizer that finds the tf-idf matrix for us, called TfidfVectorizer()

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
x = tfidf.fit_transform(df.text)

In [17]:
print(x)

  (0, 2871)	0.538231616282
  (0, 2404)	0.279231657722
  (0, 1750)	0.294988181032
  (0, 92)	0.337896791919
  (0, 1748)	0.129034479447
  (0, 37)	0.191065991859
  (0, 748)	0.337896791919
  (0, 786)	0.337896791919
  (0, 3037)	0.294988181032
  (0, 1639)	0.250242919112
  (1, 1813)	0.16837950966
  (1, 2567)	0.298871784424
  (1, 2965)	0.195713228608
  (1, 2917)	0.128708711135
  (1, 1733)	0.208098455651
  (1, 1605)	0.288646914192
  (1, 2638)	0.153084697784
  (1, 1035)	0.312053934844
  (1, 431)	0.198190693531
  (1, 1862)	0.192231407001
  (1, 196)	0.28029258691
  (1, 1777)	0.330633132357
  (1, 1206)	0.273229103839
  (1, 1837)	0.0998301966822
  (1, 2969)	0.312053934844
  :	:
  (996, 3003)	0.269069466257
  (996, 2288)	0.261415181915
  (996, 2812)	0.344992074668
  (996, 2886)	0.395174189216
  (996, 2143)	0.395174189216
  (997, 1358)	0.271928281324
  (997, 1423)	0.224838839198
  (997, 1428)	0.232666392035
  (997, 3001)	0.619969808539
  (997, 837)	0.661064514796
  (998, 222)	0.46735005602
  (998, 911)

this is a massive matrix, since there are so many words and there are two problems when working with massive matrices:
- they take up too much comp memory
- theyre harder to train models on to predict outcomes from new, unseen data

so, we can use a sparse matrix, which stores only the non-zero elements. CountVectorizer() and TfidfVectorizer() return sparse matrices

to deal with problem number two, we can use logistic regression which filters out the columns that are not necessary for classification - technique called regularization <b> OR </b> we can compress the term - document matrix

compressing is like clustering - each word is assigned a score based on how closely it is associated with a cluster. for example - "bad", "awful", "terrible" all have similar meanings, and so by this method they'll be compressed into one column

^^ known as reducing the dimensionality of a term - document matrix 

In [23]:
from muffnn import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

# Chain together tf-idf and an MLP with a single hidden layer of size 256
mlp = MLPClassifier(hidden_units=(256,))
classifier = make_pipeline(tfidf, mlp)

# Get cross-validated accuracy of the model
cv_accuracy = cross_val_score(classifier, df.text, df.label, cv=5)
print("Mean Accuracy: {}".format(np.mean(cv_accuracy)))

Mean Accuracy: 0.7809999999999999
Pipeline(memory=None,
     steps=[('tfidfvectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_i...rflow.python.training.adam.AdamOptimizer'>,
       solver_kwargs=None, transform_layer_index=None))])
