## Introduction
This tutorial introduces Pattern, which is a web mining module for the Python programming language.

It has independent modules for :<br/>
1) data mining:
Google, Twitter, Bing, Yahoo, Facebook and Wikipedia APIs that support varying degrees of search functionality, including image search on some sources; a web crawler, a HTML DOM parser<br/>
2) natural language processing:
part-of-speech taggers, n-gram search, sentiment analysis, WordNet<br/>
3) machine learning:
vector space model, clustering, SVM, network analysis and visualization.<br/>

With the extensive APIs provided by the web module, Pattern can help to extract data from a variety of sources, and its support for several different data processing and visualization tools in the language processing and machine learning modules makes it a very useful module for data science purposes. Although many of these functionalities can be extracted from different libraries introduced in class, such as Beautifulsoup for web scraping, Scikit-learn for machine learning algorithms or NLTK for NLP purposes, Pattern not only offers a single unified source of all such information, but it also adds some interesting features not found in those libraries, such as a web crawler in the Web module that can make data mining much easier, integrated APIs for mining from popular networking sites, and a Genetic Algorithm implementation and Latent Semantic Analysis (LSA) capability in the Vector module.

### Tutorial content

In this tutorial, we will show how to use the [Pattern](http://www.clips.ua.ac.be/pattern) module for data mining and processing, specifically using its [web module](http://www.clips.ua.ac.be/pages/pattern-web/), its [NLP module](http://www.clips.ua.ac.be/pages/pattern-en), and its [Machine Learning module](https://geopy.readthedocs.io).


We will cover the following topics in this tutorial:
- [Installing the library and acquiring licenses](#Installing-the-libraries-and-acquiring-licenses)
- [Data Mining using the APIs](#Data-mining-using-the-APIs)
- [Using the NLP tooklit](#NLP-module)
- [Using the Vector module](#Vector-module)


## Installing the libraries and acquiring licenses

Before getting started, you'll need to install the Pattern library, which is written for Python 2.5+ (no support for Python 3 yet). It has no external dependencies, except LSA in the pattern.vector module, which requires NumPy. 
You can install Pattern using `pip`:

    $ pip install pattern
    
In case the pip installation causes some problems, a slightly older version is available on Conda as well, which can accessed as:

    $ conda install pattern

If neither of these methods work, then the file can be downloaded from the [web site](http://www.clips.ua.ac.be/pattern). 
To install Pattern so that the module is available in all Python scripts, from the command line do:

        $ cd pattern-2.6
    $ python setup.py install 

After you run the installs, make sure the following commands work for you:

In [65]:
from pattern.web import URL, SEARCH
from pattern.en import parse, parsetree
from pattern.vector import Document, Model
from pattern.graph import Graph

To use the search APIs, you will need to acquire the licenses from their respectives sites, as listed here: 
[Google](https://code.google.com/apis/console/), [Facebook](http://www.clips.ua.ac.be/pattern-facebook),[Bing](https://datamarket.azure.com/dataset/5BA839F1-12CE-4CCE-BF57-A49D98D29A44),[Twitter](https://apps.twitter.com/app/new) 

## Data mining using the APIs

Having acquired the licenses for the various APIs, we can use them depending on our data needs. In general, Pattern's SearchEngine object has a number of subclasses that can be used to query different web services (e.g., Google, Wikipedia). SearchEngine.search() returns a list of Result objects for a given query string, just like a search field and a results page in a browser.
Each search engine has different settings for the search() method, and the different APIs provide varying features. In this section, we will look at some of the tasks that we can carry out using the Google, Facebook and Twitter APIs.

### Facebook
Facebook offers the option to search for publicly-available posts that contain the data that we specify in our query using the SEARCH type. Once a user has authorized Pattern to access their profile, it can be used to search for posts from a specfic user using the NEWS type. The COMMENTS type also allows to access the comments data associated with a particular post, idenified by its post ID (possibly obtained from the NEWS or SEARCH queries).

In [64]:
#Here, I am printing the list of posts that I have made, along the likes and comments associated with each post
from pattern.web import Facebook, SEARCH, NEWS, COMMENTS, LIKES
fb = Facebook(license=None) #Enter license here
me = fb.profile(id=None)
print "Profile name:",me[1]
posts=fb.search(me[0], type=NEWS, count=100)
for post in posts:
    print repr(post.id)
    print repr(post.text)
    print repr(post.url)
    if post.comments > 0:
        print '%i comments' % post.comments 
        print [(r.text, r.author) for r in fb.search(post.id, type=COMMENTS)]
    if post.likes > 0:
        print '%i likes' % post.likes 
        print [r.author for r in fb.search(post.id, type=LIKES)]
 

Profile name: Ritwik Rajendra


### Twitter
Twitter provides a couple of methods that allow us to keep track of the newly available data, as well as the currently popular topics. <br/>
Twitter.stream() returns an endless, live stream of Result objects. A Stream is a Python list that accumulates each time Stream.update() is called; while the trends() function returns the list of the most popular topics on Twitter at that time.

In [70]:
from pattern.web import Twitter
print "Top trending Twitter topics as of "+time.strftime("%d/%m/%Y %H:%M:%S")+":"
trends= Twitter().trends(cached=False, count=10)
for trend in trends:
    print trend

s = Twitter().stream(trends[0])
print "\nTweets related to the most trending topic, ",trends[0]," are"
for i in range(10):
    time.sleep(1) 
    s.update()
    print s[-1].text if s else ''

Top trending Twitter topics as of 19/10/2016 21:05:58:
#debatenight
World Series
Messi
#افلاس_السعوديه_بعد_3سنوات
Enem
Juventude
#MTVCatfishBR
#LaVoz5
#RejectedHillarySlogans
Jara
Pedro Rocha
Arrascaeta
VAI CORINTHIANS
Charlie Brown
São Victor
Derrick Rose
Borja
I'M DIRECTIONER
Urias
Thiago Martins
Polic
#TylerEndedBellaParty
#من_ملوثات_تويتر
#EuFicariaFelizSe
#Desaforados
#RRYMUDRodaronFeo
#GBBO
#sddsInvernoDetremuraSdv
#VivasLasQueremos
#SimoneeSimariaNoTVZ
#برشلونه_السيتي
#TodoCambia
#SignsYoureLying
#DinahAppreciationDay
#BeRomanticIn4Words
#POGOENMTVHITS
#TeQuedasOTevas
#MiercolesIntratable
#PoyrazKarayel
#لو_تدري_انك
#このタグ見た人は今欲しいものを言う
#WWENXT
#عايش_ليه_في_مصر
#SaiaJusta
#PorLaNochePrefiero
#NaucalpanSinLey
#Beşiktaş
#Top100DJs
#FicaRavena
#Velvet3

Tweets related to the most trending topic,  #debatenight  are


I feel like blood pressure medications must be sponsoring this election #debatenight #PresidentialDebate
I feel like blood pressure medications must be sponsoring this el

In [57]:
#Example of search with query on Twitter
t = Twitter()
for tweet in t.search('#arsenal', start=i, count=10):
    print tweet.text

so, every time i go out for a ride during an #Arsenal match and miss it, they boss it. I hear you, universe. I hear you. #coyg
RT @Stuart_PhotoAFC: Dennis Bergkamp and @MesutOzil1088 #arsenal https://t.co/NxKcB66lYs
RT @yara_lb: Selfie with The Legend of #Arsenal and #Barcelona @thierryhenry
#TierryHenry
#يارا #يارا_فانز #تيري_هنري #اسبانيا #برشلونة htt…
RT @ArsenalsRelated: Alexis Sánchez golazo vs. Ludogorets Razgrad. 🔥🔥🔥 #Arsenal https://t.co/D07AHsWwG4
⚽🏃 #Arsenal #foreverArsenal
RT @soccerdotcom: Mesut Ozil turning into a ruthless goal scorer now 👀👀 #Arsenal #UCL
Hoping that #Hillary delivers a performance as dominant as #Arsenal did today #ImWithHer #debate
#Arsenal Arsene Wenger: "Our confidence is stronger with every win but we have to keep the vigilance and  bring that into the next game."
#Ludogorets; I've had a few. #Arsenal
RT @ArsenalsRelated: El Jefe and der Chef.

#Arsenal https://t.co/Cg1qo3Krxt


### Google:
Apart from the generic search functionality extended from the SearchEngine class, The Google API also allows the user access the functionalities offered by Google Translate API.
Google.translate() returns the translated string in the given language.
Google.identify() returns a (language code, confidence)-tuple for a given string

Please note that the Translate API is a paid service, and it needs to enabled for the license being used with Pattern.

In [72]:
from pattern.web import Google
engine = Google(license=None) #provide your license here, necessary to access Translate API
for result in engine.search('CMU', cached=False):
    print result.url, plaintext(result.text)

http://www.cmu.edu/ CMU is a global research university known for its world-class, interdisciplinary
programs: arts, business, computing, engineering, humanities, policy and
science.
https://www.cmich.edu/ Students are offered educational experiences in the arts, humanities, and natural
and social sciences, in addition to educational depth in at least one academic ...
https://en.wikipedia.org/wiki/Carnegie_Mellon_University Coordinates: 40°26′36″N 79°56′37″W﻿ / ﻿40.443322°N 79.943583°W﻿ /
40.443322; -79.943583. Carnegie Mellon University is a private research
university ...
https://www.scs.cmu.edu/ Education in computer music, data mining, machine learning, vision, and speech
with a list of research topics.
https://www.cmu.edu/silicon-valley/ About Carnegie Mellon University Silicon Valley. CMU Silicon Valley. At its
Silicon Valley location, CMU's College of Engineering can integrate the rich
heritage ...
http://www.coloradomesa.edu/ A four-year state-supported institution in Grand Jun

In [None]:
s = "Qui craint de souffrir, il souffre déjà de ce qu’il craint."
g = Google()
lang,conf=g.identify(s)
print "Language is ", lang
print g.translate(s, input=lang, output='en', cached=False)

### RSS news feeds:
It is also possible to run searches over any RSS news feed pages using the Newsfeed object.

In [71]:
#Here, I am printing the list of topics available in the Panopto RSS feed for the 15-688 course
from pattern.web import Newsfeed
PDS = 'http://scs.hosted.panopto.com/Panopto/Podcast/Podcast.ashx?courseid=e33e67d2-5935-4103-b6b2-87c4097f8da4&type=mp4'
for result in Newsfeed().search(PDS,):
    print repr(result.title)
    print repr(result.guid)

u'Lecture 01: Introduction'
u''
u'Lecture 2: Data Collection and Scraping'
u''
u'Lecture 03: Jupiter Notebook Lab'
u''
u'Lecture 4: Relational Data'
u''
u'Lecture 5: Visualization and Data Exploration'
u''
u'Lecture 6: Matrices, Vectors, and Linear Algebra'
u''
u'Lecture 7: Graph and Network Processing'
u''
u'Lecture 8: Free text and natural language processing'
u''
u'Lecture 9: Free text, continued'
u''
u'Lecture 10: Linear Regression'
u''


### Web Crawler:
Pattern also provides a Crawler object, that takes a list of URLs, which it then visits. If they lead to a web page, the HTML content is parsed for new links. These are added to the list of links scheduled for a visit.
As arguements during initialization , it expects domains and delay as parameters. The given domains is the list of URLs that the crawler is expected to visit. The given delay defines the number of seconds to wait before revisiting the same domain. <br/>
The example below defines a crawler that prints every link that it visits. In this manner, the crawler can be set up to carry out any required task by overwriting the visit() method.

In [75]:
from pattern.web import Crawler
class Printer(Crawler): 
    def visit(self, link, source=None):
        print 'visited:', link.url, 'from:', link.referrer
    def fail(self, link):
        print 'failed:', link.url

p = Printer(links=['http://datasciencecourse.org/'], delay=3)
for i in range(0,10):
    p.crawl(method=DEPTH, cached=False, throttle=3)

visited: http://datasciencecourse.org/ from: 
visited: http://datasciencecourse.org/#assignments from: http://datasciencecourse.org/
visited: http://datasciencecourse.org/#faq from: http://datasciencecourse.org/
visited: http://datasciencecourse.org/#instructors from: http://datasciencecourse.org/
visited: http://datasciencecourse.org/#overview from: http://datasciencecourse.org/
visited: http://datasciencecourse.org/#page-top from: http://datasciencecourse.org/
visited: http://datasciencecourse.org/#schedule from: http://datasciencecourse.org/
failed: http://datasciencecourse.org/NetworkXBasics.ipynb
failed: http://www.datasciencecourse.org/hw/1/handout.tar
visited: https://scs.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=0071471a-c880-4390-aa0d-3709f3d37776 from: http://datasciencecourse.org/


## NLP module

Pattern provides an NLP toolkit called Pattern.en for the English language, with a parser capable of tokenizing, stemming and parts-of-speech tagging; and a sentiment analyser being among its prominent features. There are similar modules for different languages as well, such as the Es, Dr, Fr and It modules. 

### Parser:
Pattern also provides a Parser which can identify words, sentences and parts of speech from text strings. This involves tokenization (breaking up the text into words, by removing spaces and punctuations), part-of-speech tagging (annotating words with their type, e.g., is can a noun or a verb?) and chunking (grouping word sequences that have some special meaning). 

The parse() function takes a string, and accepts several parameters that decide its functionality as shown below.

In [5]:
from pattern.en import parse

string='The quick brown fox, "Qfox", jumped over the lazy dog, "Ldog".'
print parse(string,
        tokenize = True,         # Splits punctuation marks from words
        tags = True,             # Parses part-of-speech tags (NN, JJ, ...)
        chunks = True,           # Parses chunks (NP, VP, PNP, ...)
        relations = False,       # Parse chunk relations? (-SBJ, -OBJ, ...)
        lemmata = False,         # Lemmatizes the words (ate => eat)
        encoding = 'utf-8',      # Input string encoding.
         tagset = None)          


The/DT/B-NP/O quick/JJ/I-NP/O brown/JJ/I-NP/O fox/NN/I-NP/O ,/,/O/O "/"/O/O Qfox/UH/O/O "/"/O/O ,/,/O/O jumped/VBD/B-VP/O over/IN/B-PP/B-PNP the/DT/B-NP/I-PNP lazy/JJ/I-NP/I-PNP dog/NN/I-NP/I-PNP ,/,/O/O "/"/O/O Ldog/NN/B-NP/O "/"/O/O ././O/O


Common part-of-speech tags are NN (noun), VB (verb), JJ (adjective), RB (adverb) and IN (preposition).
Common chunk tags are NP (noun phrase) and VP (verb phrase).
Common chunk relations are NP-SBJ (subject) and NP-OBJ (object).

But as seen in the above example, the code is not in a very readable format. As a result, a more convenient function to use is the parseTree() function, which stores a tagged string as a tree of nested objects that can be traversed to analyze the components of the string. It takes the same parameters as parse() and returns a Text object. A Text is a list of Sentence objects. Each Sentence is a list of Word objects. Word objects can be grouped in Chunk objects, which are related to other Chunk objects.

In [13]:
from pattern.en import parsetree

string='The quick brown fox, "Qfox", jumped over the lazy dog, "Ldog". Slow and steady wins the race, just ask the Tortoise.'
treestruct = parsetree(string, relations=True, lemmata=True)
print "String representation is: \n",repr(s),"\n"
print "Tree representation is:"
for sentence in treestruct:
    for chunk in sentence.chunks:
        print chunk.type, [(w.string, w.type) for w in chunk.words]

String representation is: 
[Sentence('The/DT/B-NP/O/O/the quick/JJ/I-NP/O/O/quick brown/JJ/I-NP/O/O/brown fox/NN/I-NP/O/O/fox ,/,/O/O/O/, "/"/O/O/O/" Qfox/UH/O/O/O/qfox "/"/O/O/O/" ,/,/O/O/O/, jumped/VBD/B-VP/O/O/jump over/IN/B-PP/B-PNP/O/over the/DT/B-NP/I-PNP/O/the lazy/JJ/I-NP/I-PNP/O/lazy dog/NN/I-NP/I-PNP/O/dog ,/,/O/O/O/, "/"/O/O/O/" Ldog/NN/B-NP/O/O/ldog "/"/O/O/O/" ././O/O/O/.'), Sentence('Slow/JJ/B-ADJP/O/O/slow and/CC/O/O/O/and steady/RB/B-VP/O/VP-1/steady wins/VBZ/I-VP/O/VP-1/win the/DT/B-NP/O/NP-OBJ-1/the race/NN/I-NP/O/NP-OBJ-1/race ,/,/O/O/O/, ask/VB/B-VP/O/VP-2/ask the/DT/B-NP/O/NP-OBJ-2/the Tortoise/NNP/I-NP/O/NP-OBJ-2/tortoise ././O/O/O/.')] 

Tree representation is:
NP [(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN')]
VP [(u'jumped', u'VBD')]
PP [(u'over', u'IN')]
NP [(u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]
NP [(u'Ldog', u'NN')]
ADJP [(u'Slow', u'JJ')]
VP [(u'steady', u'RB'), (u'wins', u'VBZ')]
NP [(u'the', u'DT'), (u'race', u'NN')]


### Sentiment analysis:
The pattern.en module bundles a corpus of adjectives (e.g., good, bad, amazing, irritating, annoying,etc.) that occur frequently in reviews and social media text, annotated with scores for sentiment polarity (positive vs. negative) and subjectivity (objective vs. subjective), in order to provide a sentiment analysis for the user provided strings.

The sentiment() function returns a tuple for polarity and subjectivity for the given sentence, based on the adjectives it contains, where polarity is a value between -1.0(most negative sentiment) and +1.0(most positive sentiment) and subjectivity between 0.0 and 1.0. The sentence can be a string, or a datatype returned from parseTree(),such as Text, Sentence, Chunk, or Word. The returned value also has an assessments attribute, that provides the polarity and subjectivity score for each individual evaluated term/set of words.

The positive() function returns True if the given sentence's polarity is above the threshold. The threshold can be lowered or raised, but overall +0.1 gives the best results for product reviews as mentioned in the Pattern documentation.


In [21]:
from pattern.en import sentiment,positive
sent= sentiment(
        "Batman v Superman has its moments though with great performances from almost every one,but it was difficult to not walk out of this film and feel overwhelmingly disappointed.")
print sent
print "assessments: ",sent.assessments
print "judgement: ",positive(sent, threshold=0.1)

(-0.15, 0.8333333333333334)
assessments:  [(['great'], 0.8, 0.75, None), (['difficult'], -0.5, 1.0, None), (['overwhelmingly', 'disappointed'], -0.75, 0.75, None)]
judgement:  False


## Vector module

The Pattern.vector module contains machine learning tools, including bag-of-word document representations, latent semantic analysis and clustering and classification algorithms. 

Pattern provides Document as a bag-of-words representation of a text, i.e., unordered words along with their word count. The Document.vector maps the words (or features) to their weight (word count, tf-idf, etc.). The weight of a word represents its relevancy in the text. Given an unlabeled document, a classifier yields the label of the most similar document(s) in its training set. 

### Classification:
Classification is a supervised machine learning method that uses labeled documents as training examples to statistically predict the label (class, type) of new documents, using the model built on the training examples. 
The Pattern.vector module implements four classification algorithms:
NB: Naive Bayes, based on the probability that a feature occurs in a class.
KNN: k-nearest neighbor, based on the k most similar documents in the training set.
SLP: single-layer averaged perceptron, based on an artificial neural network.
SVM: support vector machine, based on a representation of the documents in a high-dimensional space separated by hyperplanes

These classifiers can be defined as follows, and all of them inherit from the Classifier base class:



In [42]:
from pattern.vector import *
classifier =  NB(train=[], baseline=MAJORITY, method=MULTINOMIAL, alpha=0.0001)
classifier = KNN(train=[], baseline=MAJORITY, k=10, distance=COSINE)
classifier = SLP(train=[], baseline=MAJORITY, iterations=1)
classifier = SVM(train=[], type=CLASSIFICATION, kernel=LINEAR)

The Classifier base class provides 3 main functions, train() and classify().

Classifier.train() trains the classifier with the given features and type (= class label). Not that train is called repeatedly with each new training input provided to the classfier.

Classifier.classify() returns the label with the highest probability for the given input.

Classifier.test() returns an (accuracy, precision, recall, F1-score)-tuple.


In [65]:
from pattern.vector import Document, NB
from pattern.db import csv

nb = NB()

reviews = ["great action","best CGI ever!","waste of money", "could have been better","terrible acting"]
ratings = [4,5,3,2,1]

for review,rating in zip(reviews,ratings):
    featureVec=Document(review, type=int(rating), stopwords=True)
    nb.train(featureVec)

print nb.classes
print nb.classify(Document("terrible movie"))
print nb.classify(Document("one of the best movies of the year"))

[1, 2, 3, 4, 5]
1
4


### Latent Semantic Analysis:

Latent Semantic Analysis (LSA) is a statistical technique based on singular value decomposition([SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition)). It groups related features in the model into concepts (e.g., purr + fur + claw = feline concept). This is called dimensionality reduction. Each document in the model then gets a concept vector, a compressed approximation of the original vector that may be faster for cosine similarity, clustering and classification.

LSA.transform() takes a Document and returns its Vector in concept space. This is useful for documents that are not part of the model, and can basically work as an alternative to the Classifier.classify() method.

The following example demonstrates how related features are grouped after LSA.


In [67]:
from pattern.vector import Document, Model

d1 = Document('The cat purrs.', name='cat1')
d2 = Document('Curiosity killed the cat.', name='cat2')
d3 = Document('The dog wags his tail.', name='dog1')
d4 = Document('The dog is happy.', name='dog2')

m = Model([d1, d2, d3, d4])
m.reduce(2)

for d in m.documents:
    print
    print d.name
    for concept, w1 in m.lsa.vectors[d.id].items():
        for feature, w2 in m.lsa.concepts[concept].items():
            if w1 != 0 and w2 != 0:
                print (feature, w1 * w2)


cat1
(u'cat', 0.4618802153517004)
(u'curiosity', 0.2309401076758502)
(u'purrs', 0.6928203230275505)
(u'killed', 0.2309401076758502)

cat2
(u'cat', 0.23094010767585008)
(u'curiosity', 0.11547005383792504)
(u'purrs', 0.34641016151377513)
(u'killed', 0.11547005383792504)

dog1
(u'wags', 0.11547005383792498)
(u'dog', 0.23094010767584996)
(u'tail', 0.11547005383792498)
(u'happy', 0.3464101615137754)

dog2
(u'wags', 0.23094010767585024)
(u'dog', 0.4618802153517005)
(u'tail', 0.23094010767585024)
(u'happy', 0.6928203230275516)


In the above example, the model is reduced to two dimensions. So there are two concepts in the concept space. Each document has a concept vector with weights for each concept. As shown, cat features have been grouped together and dog features have been grouped together.

### K-fold cross-validation:
This is a validation method where K tests are done on a given classifier, each time partitioning the given dataset into different subsets for training and testing, and returns the average of the performance results with each iteration. This is more generalized, and hence more reliable than always using the same training data.

Pattern provides a method for this, and can be used as follows:

kfoldcv(Classifier, documents=[], folds=10, target=None)

it returns a tuple of the average accuracy, precision, recall, F1 score, and std deviation. Also, kfoldcv() takes any parameters of the given Classifier as optional parameters.

### Application- movie review classification:
For example, if we have a corpus of movie reviews (training data) for which the rating is known (labels, within range 0-1), we can use it to predict the rating of other reviews, based on features extracted from the training data. 

we can improve on the simple implementation shown earlier, by representing each review as a vector of adjectives (e.g., good, bad, awesome, awful, etc.) since positive reviews (good, awesome) will most likely contain different adjectives than negative reviews (bad, awful). Thus, we can use the parts of speech tagger from the En module to get the adjectives from each review and build the model using them. We can also include nouns and verbs as they form an essential part of the review.


In [70]:
from pattern.vector import NB, kfoldcv, count
from pattern.db import csv
from pattern.en import parsetree

def getFeatures(review):
    tree = parsetree(review, lemmata=True)[0]
    features= [w.lemma for w in v if w.tag.startswith(('JJ', 'NN', 'VB', '!'))]
    return features 

with open('subj.txt') as f:
    reviews = f.readlines()
with open('rating.txt') as f:
    ratings = f.readlines()

data = [(v(review), float(rating)) for review, rating in zip(reviews,ratings)]
print kfoldcv(NB, data)

(0.5113002922497332, 0.19196699556250568, 0.13615192244539878, 0.15834131073240393, 0.06052104170168932)


Another means of classification is to use the sentiment analysis results as a feature.

In [71]:
from pattern.vector import NB, kfoldcv, count
from pattern.db import csv
from pattern.en import sentiment,positive

def getFeatures(review):
    return sentiment(Document(review))[0] 

with open('subj.txt') as f:
    reviews = f.readlines()
with open('rating.txt') as f:
    ratings = f.readlines()

data = [(v(review), float(rating)) for review, rating in zip(reviews,ratings)]
print kfoldcv(NB, data)

(0.5088543687158985, 0.1788190509087837, 0.13835569937934494, 0.15452916315976306, 0.034193001685803484)


## References
1. Pattern website: http://www.clips.ua.ac.be/pattern
2. Movie review dataset: https://www.cs.cornell.edu/people/pabo/movie-review-data/
