# Chapter 8. Sentiment Analyser Application for Movie Reviews
In this chapter, we describe an application to determine the sentiment of movie reviews using algorithms and methods described throughout the book. In addition, the Scrapy library will be used to collect reviews from different websites through a search engine API (Bing search engine). The text and the title of the movie review is extracted using the newspaper library or following some pre-defined extraction rules of an HTML format page. The sentiment of each review is determined using a naive Bayes classifier on the most informative words (using the X2 measure) in the same way as in Chapter 4, Web Mining Techniques. Also, the rank of each page related to each movie query is calculated for completeness using the PageRank algorithm discussed in Chapter 4, Web Mining Techniques. This chapter will discuss the code used to build the application, including the Django models and views and the Scrapy scraper is used to collect data from the web pages of the movie reviews. We start by giving an example of what the web application will be and explaining the search engine API used and how we include it in the application. We then describe how we collect the movie reviews, integrating the Scrapy library into Django, the models to store the data, and the main commands to manage the application. All the code discussed in this chapter is available in the GitHub repository of the author inside the chapter_8 folder at https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_8.

# Application usage overview
The home web page is as follows:

Application usage overview
The user can type in the movie name, if they want to know the review's sentiments and relevance. For example, we look for Batman vs Superman Dawn of Justice in the following screenshot:

Application usage overview
The application collects and scrapes 18 reviews from the Bing search engine and, using the Scrapy library, it analyzes their sentiment (15 positive and 3 negative). All data is stored in Django models, ready to be used to calculate the relevance of each page using the PageRank algorithm (the links at the bottom of the page as seen in the preceding screenshot). In this case, using the PageRank algorithm, we have the following:

Application usage overview
This is a list of the most relevant pages to our movie review search, setting a depth parameter 2 on the scraping crawler (refer the following section for further details). Note that to have a good result on page relevance, you have to crawl thousands of pages (the preceding screenshot shows results for around 50 crawled pages).

To write the application, we start the server as usual (see Chapter 6, Getting Started with Django, and Chapter 7, Movie Recommendation System Web Application) and the main app in Django. First, we create a folder to store all our codes, movie_reviews_analyzer_app, and then we initialize Django using the following command:
```python
mkdir  movie_reviews_analyzer_app
cd  movie_reviews_analyzer_app
django-admin startproject webmining_server
python manage.py startapp startapp pages
```
We set the settings in the .py file as we did in the Settings section of Chapter 6, Getting Started with Django, and the Application Setup section of Chapter 7, Movie Recommendation System Web Application (of course, in this case the name is webmining_server instead of server_movierecsys).

The sentiment analyzer application has the main views in the .py file in the main webmining_server folder instead of the app (pages) folder as we did previously (see Chapter 6, Getting Started with Django, and Chapter 7, Movie Recommendation System Web Application), because the functions now refer more to the general functioning of the server instead of the specific app (pages).

The last operation to make the web service operational is to create a superuser account and go live with the server:
```python
python manage.py createsuperuser (admin/admin)
python manage.py runserver
```
Now that the structure of the application has been explained, we can discuss the different parts in more detail starting from the search engine API used to collect URLs.

# Search engine choice and the application code

Since scraping directly from the most relevant search engines such as Google, Bing, Yahoo, and others is against their term of service, we need to take initial review pages from their REST API (using scraping services such as Crawlera, http://crawlera.com/, is also possible). We decided to use the Bing service, which allows 5,000 queries per month for free.

In order to do that, we register to the Microsoft Service to obtain the key needed to allow the search. Briefly, we followed these steps:

- Register online on https://datamarket.azure.com.
- In My Account, take the Primary Account Key.
- Register a new application (under DEVELOPERS | REGISTER; put Redirect URI: https://www.bing.com)
After that, we can write a function that retrieves as many URLs relevant to our query as we want:
```python
num_reviews = 30 
def bing_api(query):
    keyBing = API_KEY        # get Bing key from: https://datamarket.azure.com/account/keys
    credentialBing = 'Basic ' + (':%s' % keyBing).encode('base64')[:-1] # the "-1" is to remove the trailing "\n" which encode adds
    searchString = '%27X'+query.replace(" ",'+')+'movie+review%27'
    top = 50#maximum allowed by Bing
    
    reviews_urls = []
    if num_reviews<top:
        offset = 0
        url = 'https://api.datamarket.azure.com/Bing/Search/Web?' + \
              'Query=%s&$top=%d&$skip=%d&$format=json' % (searchString, num_reviews, offset)

        request = urllib2.Request(url)
        request.add_header('Authorization', credentialBing)
        requestOpener = urllib2.build_opener()
        response = requestOpener.open(request)
        results = json.load(response)
        reviews_urls = [ d['Url'] for d in results['d']['results']]
    else:
        nqueries = int(float(num_reviews)/top)+1
        for i in xrange(nqueries):
            offset = top*i
            if i==nqueries-1:
                top = num_reviews-offset
                url = 'https://api.datamarket.azure.com/Bing/Search/Web?' + \
                      'Query=%s&$top=%d&$skip=%d&$format=json' % (searchString, top, offset)

                request = urllib2.Request(url)
                request.add_header('Authorization', credentialBing)
                requestOpener = urllib2.build_opener()
                response = requestOpener.open(request) 
            else:
                top=50
                url = 'https://api.datamarket.azure.com/Bing/Search/Web?' + \
                      'Query=%s&$top=%d&$skip=%d&$format=json' % (searchString, top, offset)

                request = urllib2.Request(url)
                request.add_header('Authorization', credentialBing)
                requestOpener = urllib2.build_opener()
                response = requestOpener.open(request) 
            results = json.load(response)
            reviews_urls += [ d['Url'] for d in results['d']['results']]
    return reviews_urls
    ```
The API_KEY parameter is taken from the Microsoft account, query is a string which specifies the movie name, and num_reviews = 30 is the number of URLs returned in total from the Bing API. With the list of URLs that contain the reviews, we can now set up a scraper to extract from each web page the title and the review text using Scrapy.

# Scrapy setup and the application code
Scrapy is a Python library is used to extract content from web pages or to crawl pages linked to a given web page (see the Web crawlers (or spiders) section of Chapter 4, Web Mining Techniques, for more details). To install the library, type the following in the terminal:
```python
sudo pip install Scrapy 
```
Install the executable in the bin folder:
```python
sudo easy_install scrapy
```
From the movie_reviews_analyzer_app folder, we initialize our Scrapy project as follows:
```python
scrapy startproject scrapy_spider
```
This command will create the following tree inside the scrapy_spider folder:
```python
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py ├── spiders
├── spiders
│   ├── __init__.py
```
The pipelines.py and items.py files manage how the scraped data is stored and manipulated, and they will be discussed later in the Spiders and Integrate Django with Scrapy sections. The settings.py file sets the parameters each spider (or crawler) defined in the spiders folder uses to operate. In the following two sections, we describe the main parameters and spiders used in this application.

# Scrapy settings
The settings.py file collects all the parameters used by each spider in the Scrapy project to scrape web pages. The main parameters are as follows:

- DEPTH_LIMIT: The number of subsequent pages crawled following an initial URL. The default is 0 and it means that no limit is set.
- LOG_ENABLED: To allow/deny Scrapy to log on the terminal while executing default is true.
- ITEM_PIPELINES = {'scrapy_spider.pipelines.ReviewPipeline': 1000,}: The path of the pipeline function to manipulate data extracted from each web page.
- CONCURRENT_ITEMS = 200: The number of concurrent items processed in the pipeline.
- CONCURRENT_REQUESTS = 5000: The maximum number of simultaneous requests handled by Scrapy.
- CONCURRENT_REQUESTS_PER_DOMAIN = 3000: The maximum number of simultaneous requests handled by Scrapy for each specified domain.
The larger the depth, more the pages are scraped and, consequently, the time needed to scrape increases. To speed up the process, you can set high value on the last three parameters. In this application (the spiders folder), we set two spiders: a scraper to extract data from each movie review URL (movie_link_results.py) and a crawler to generate a graph of webpages linked to the initial movie review URL (recursive_link_results.py).

# Scraper
The scraper on movie_link_results.py looks as follows:
```python
from newspaper import Article
from urlparse import urlparse
from scrapy.selector import Selector
from scrapy import Spider
from scrapy.spiders import BaseSpider,CrawlSpider, Rule
from scrapy.http import Request
from scrapy_spider import settings
from scrapy_spider.items import PageItem,SearchItem

unwanted_domains = ['youtube.com','www.youtube.com']
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

def CheckQueryinReview(keywords,title,content):
    content_list = map(lambda x:x.lower(),content.split(' '))
    title_list = map(lambda x:x.lower(),title.split(' '))
    words = content_list+title_list
    for k in keywords:
        if k in words:
            return True
    return False

class Search(Spider):
    name = 'scrapy_spider_reviews'
    
    def __init__(self,url_list,search_key):#specified by -a
        self.search_key = search_key
        self.keywords = [w.lower() for w in search_key.split(" ") if w not in stopwords]
        self.start_urls =url_list.split(',')
        super(Search, self).__init__(url_list)
    
    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse_site,dont_filter=True)
                        
    def parse_site(self, response):
        ## Get the selector for xpath parsing or from newspaper
        
        def crop_emptyel(arr):
            return [u for u in arr if u!=' ']
        
        domain = urlparse(response.url).hostname
        a = Article(response.url)
        a.download()
        a.parse()
        title = a.title.encode('ascii','ignore').replace('\n','')
        sel = Selector(response)
        if title==None:
            title = sel.xpath('//title/text()').extract()
            if len(title)>0:
                title = title[0].encode('utf-8').strip().lower()
                
        content = a.text.encode('ascii','ignore').replace('\n','')
        if content == None:
            content = 'none'
            if len(crop_emptyel(sel.xpath('//div//article//p/text()').extract()))>1:
                contents = crop_emptyel(sel.xpath('//div//article//p/text()').extract())
                print 'divarticle'
            ….
            elif len(crop_emptyel(sel.xpath('/html/head/meta[@name="description"]/@content').extract()))>0:
                contents = crop_emptyel(sel.xpath('/html/head/meta[@name="description"]/@content').extract())
            content = ' '.join([c.encode('utf-8') for c in contents]).strip().lower()
                
        #get search item 
        search_item = SearchItem.django_model.objects.get(term=self.search_key)
        #save item
        if not PageItem.django_model.objects.filter(url=response.url).exists():
            if len(content) > 0:
                if CheckQueryinReview(self.keywords,title,content):
                    if domain not in unwanted_domains:
                        newpage = PageItem()
                        newpage['searchterm'] = search_item
                        newpage['title'] = title
                        newpage['content'] = content
                        newpage['url'] = response.url
                        newpage['depth'] = 0
                        newpage['review'] = True
                        #newpage.save()
                        return newpage  
        else:
            return null
```
We can see that the Spider class from scrapy is inherited by the Search class and the following standard methods have to be defined to override the standard methods:

- __init__: The constructor of the spider needs to define the start_urls list that contains the URL to extract content from. In addition, we have custom variables such as search_key and keywords that store the information related to the query of the movie's title used on the search engine API.
- start_requests: This function is triggered when spider is called and it declares what to do for each URL in - - the start_urls list; for each URL, the custom parse_site function will be called (instead of the default parse function).
- parse_site: It is a custom function to parse data from each URL. To extract the title of the review and its text content, we used the newspaper library (sudo pip install newspaper) or, if it fails, we parse the HTML file directly using some defined rules to avoid the noise due to undesired tags (each rule structure is defined with the sel.xpath command). To achieve this result, we select some popular domains (rottentomatoes, cnn, and so on) and ensure the parsing is able to extract the content from these websites (not all the extraction rules are displayed in the preceding code but they can be found as usual in the GitHub file). The data is then stored in a page Django model using the related Scrapy item and the ReviewPipeline function (see the following section).
- CheckQueryinReview: This is a custom function to check whether the movie title (from the query) is contained in the content or title of each web page.


To run the spider, we need to type in the following command from the scrapy_spider (internal) folder:
```python
scrapy crawl scrapy_spider_reviews -a url_list=listname -a search_key=keyname
```
# Pipelines

The pipelines define what to do when a new page is scraped by the spider. In the preceding case, the parse_site function returns a PageItem object, which triggers the following pipeline (pipelines.py):
```python
class ReviewPipeline(object):
    def process_item(self, item, spider):
        #if spider.name == 'scrapy_spider_reviews':#not working
           item.save()
           return item
```
This class simply saves each item (a new page in the spider notation).

# Crawler
As we showed in the overview (the preceding section), the relevance of the review is calculated using the PageRank algorithm after we have stored all the linked pages starting from the review's URL. The crawler recursive_link_results.py performs this operation:
```python
#from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request

from scrapy_spider.items import PageItem,LinkItem,SearchItem

class Search(CrawlSpider):
    name = 'scrapy_spider_recursive'
    
    def __init__(self,url_list,search_id):#specified by -a
    
        #REMARK is allowed_domains is not set then ALL are allowed!!!
        self.start_urls = url_list.split(',')
        self.search_id = int(search_id)
        
        #allow any link but the ones with different font size(repetitions)
        self.rules = (
            Rule(LinkExtractor(allow=(),deny=('fontSize=*','infoid=*','SortBy=*', ),unique=True), callback='parse_item', follow=True), 
            )
        super(Search, self).__init__(url_list)

    def parse_item(self, response):
        sel = Selector(response)
        
        ## Get meta info from website
        title = sel.xpath('//title/text()').extract()
        if len(title)>0:
            title = title[0].encode('utf-8')
            
        contents = sel.xpath('/html/head/meta[@name="description"]/@content').extract()
        content = ' '.join([c.encode('utf-8') for c in contents]).strip()

        fromurl = response.request.headers['Referer']
        tourl = response.url
        depth = response.request.meta['depth']
        
        #get search item 
        search_item = SearchItem.django_model.objects.get(id=self.search_id)
        #newpage
        if not PageItem.django_model.objects.filter(url=tourl).exists():
            newpage = PageItem()
            newpage['searchterm'] = search_item
            newpage['title'] = title
            newpage['content'] = content
            newpage['url'] = tourl
            newpage['depth'] = depth
            newpage.save()#cant use pipeline cause the execution can finish here
        
        #get from_id,to_id
        from_page = PageItem.django_model.objects.get(url=fromurl)
        from_id = from_page.id
        to_page = PageItem.django_model.objects.get(url=tourl)
        to_id = to_page.id
        
        #newlink
        if not LinkItem.django_model.objects.filter(from_id=from_id).filter(to_id=to_id).exists():
            newlink = LinkItem()
            newlink['searchterm'] = search_item
            newlink['from_id'] = from_id
            newlink['to_id'] = to_id
            newlink.save()
```
The CrawlSpider class from scrapy is inherited by the Search class, and the following standard methods have to be defined to override the standard methods (as for the spider case):

- __init__: The is a constructor of the class. The start_urls parameter defines the starting URL from which the spider will start to crawl until the DEPTH_LIMIT value is reached. The rules parameter sets the type of URL allowed/denied to scrape (in this case, the same page but with different font sizes is disregarded) and it defines the function to call to manipulate each retrieved page (parse_item). Also, a custom variable search_id is defined, which is needed to store the ID of the query within the other data.
- parse_item: This is a custom function called to store the important data from each retrieved page. A new Django item of the Page model (see the following section) from each page is created, which contains the title and content of the page (using the xpath HTML parser). To perform the PageRank algorithm, the connection from the page that links to each page and the page itself is saved as an object of the Link model using the related Scrapy item (see the following sections).


To run the crawler, we need to type the following from the (internal) scrapy_spider folder:
```python
scrapy crawl scrapy_spider_recursive -a url_list=listname -a search_id=keyname
```

# Django models
The data collected using the spiders needs to be stored in a database. In Django, the database tables are called models and defined in the models.py file (within the pages folder). The content of this file is as follows:
```python
from django.db import models
from django.conf import settings
from django.utils.translation import ugettext_lazy as _

class SearchTerm(models.Model):
    term = models.CharField(_('search'), max_length=255)
    num_reviews = models.IntegerField(null=True,default=0)
    #display term on admin panel
    def __unicode__(self):
            return self.term

class Page(models.Model):
     searchterm = models.ForeignKey(SearchTerm, related_name='pages',null=True,blank=True)
     url = models.URLField(_('url'), default='', blank=True)
     title = models.CharField(_('name'), max_length=255)
     depth = models.IntegerField(null=True,default=-1)
     html = models.TextField(_('html'),blank=True, default='')
     review = models.BooleanField(default=False)
     old_rank = models.FloatField(null=True,default=0)
     new_rank = models.FloatField(null=True,default=1)
     content = models.TextField(_('content'),blank=True, default='')
     sentiment = models.IntegerField(null=True,default=100)
     
class Link(models.Model):
     searchterm = models.ForeignKey(SearchTerm, related_name='links',null=True,blank=True)
     from_id = models.IntegerField(null=True)
     to_id = models.IntegerField(null=True)
```
Each movie title typed on the home page of the application is stored in the SearchTerm model, while the data of each web page is collected in an object of the Page model. Apart from the content field (HTML, title, URL, content), the sentiment of the review and the depth in graph network are recorded (a Boolean also indicates if the web page is a movie review page or simply a linked page). The Link model stores all the graph links between pages, which are then used by the PageRank algorithm to calculate the relevance of the reviews web pages. Note that the Page model and the Link model are both linked to the related SearchTerm through a foreign key. As usual, to write these models as database tables, we type the following commands:
```python
python manage.py makemigrations
python manage.py migrate
```
To populate these Django models, we need to make Scrapy interact with Django, and this is the subject of the following section.

# integrating Django with Scrapy

To make paths easy to call, we remove the external scrapy_spider folder so that inside the movie_reviews_analyzer_app, the webmining_server folder is at the same level as the scrapy_spider folder:
```python
├── db.sqlite3
├── scrapy.cfg
├── scrapy_spider
│   ├── ...
│   ├── spiders
│   │   ...
└── webmining_server
```
We set the Django path into the Scrapy settings.py file:
```python
# Setting up django's project full path.
import sys
sys.path.insert(0, BASE_DIR+'/webmining_server')
# Setting up django's settings module name.
os.environ['DJANGO_SETTINGS_MODULE'] = 'webmining_server.settings'
#import django to load models(otherwise AppRegistryNotReady: Models aren't loaded yet):
import django
django.setup()
```
Now we can install the library that will allow managing Django models from Scrapy:
```python
sudo pip install scrapy-djangoitem
```
In the items.py file, we write the links between Django models and Scrapy items as follows:
```python
from scrapy_djangoitem import DjangoItem
from pages.models import Page,Link,SearchTerm

class SearchItem(DjangoItem):
    django_model = SearchTerm
class PageItem(DjangoItem):
    django_model = Page
class LinkItem(DjangoItem):
    django_model = Link
```
Each class inherits the DjangoItem class so that the original Django models declared with the django_model variable are automatically linked. The Scrapy project is now completed so we can continue our discussion explaining the Django codes that handle the data extracted by Scrapy and the Django commands needed to manage the applications.

# Commands (sentiment analysis model and delete queries)
The application needs to manage some operations that are not allowed to the final user of the service, such as defining a sentiment analysis model and deleting a query of a movie in order to redo it instead of retrieving the existing data from memory. The following sections will explain the commands to perform these actions.

# Sentiment analysis model loader
The final goal of this application is to determine the sentiment (positive or negative) of the movie reviews. To achieve that, a sentiment classifier must be built using some external data, and then it should be stored in memory (cache) to be used by each query request. This is the purpose of the load_sentimentclassifier.py command displayed hereafter:

```python
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
import collections
from django.core.management.base import BaseCommand, CommandError
from optparse import make_option
from django.core.cache import cache

stopwords = set(stopwords.words('english'))
method_selfeatures = 'best_words_features'

class Command(BaseCommand):
    option_list = BaseCommand.option_list + (
                make_option('-n', '--num_bestwords',	
                             dest='num_bestwords', type='int',
                             action='store',
                             help=('number of words with high information')),)
    
    def handle(self, *args, **options):
         num_bestwords = options['num_bestwords']
         self.bestwords = self.GetHighInformationWordsChi(num_bestwords)
         clf = self.train_clf(method_selfeatures)
         cache.set('clf',clf)
         cache.set('bestwords',self.bestwords)
```
At the beginning of the file, the variable method_selfeatures sets the method of feature selection (in this case, the features are the words in the reviews; see Chapter 4, Web Mining Techniques, for further details) used to train the classifier train_clf. The maximum number of best words (features) is defined by the input parameter num_bestwords. The classifier and the best features (bestwords) are then stored in the cache ready to be used by the application (using the cache module). The classifier and the methods to select the best words (features) are as follows:
```python
    def train_clf(method):
        negidxs = movie_reviews.fileids('neg')
        posidxs = movie_reviews.fileids('pos')
        if method=='stopword_filtered_words_features':
            negfeatures = [(stopword_filtered_words_features(movie_reviews.words(fileids=[file])), 'neg') for file in negidxs]
            posfeatures = [(stopword_filtered_words_features(movie_reviews.words(fileids=[file])), 'pos') for file in posidxs]
        elif method=='best_words_features':
            negfeatures = [(best_words_features(movie_reviews.words(fileids=[file])), 'neg') for file in negidxs]
            posfeatures = [(best_words_features(movie_reviews.words(fileids=[file])), 'pos') for file in posidxs]
        elif method=='best_bigrams_words_features':
            negfeatures = [(best_bigrams_words_features(movie_reviews.words(fileids=[file])), 'neg') for file in negidxs]
            posfeatures = [(best_bigrams_words_features(movie_reviews.words(fileids=[file])), 'pos') for file in posidxs]
            
        trainfeatures = negfeatures + posfeatures
        clf = NaiveBayesClassifier.train(trainfeatures)
        return clf

    def stopword_filtered_words_features(self,words):
        return dict([(word, True) for word in words if word not in stopwords])

    #eliminate Low Information Features
    def GetHighInformationWordsChi(self,num_bestwords):
        word_fd = FreqDist()
        label_word_fd = ConditionalFreqDist()

        for word in movie_reviews.words(categories=['pos']):
            word_fd[word.lower()] +=1
            label_word_fd['pos'][word.lower()] +=1

        for word in movie_reviews.words(categories=['neg']):
            word_fd[word.lower()] +=1
            label_word_fd['neg'][word.lower()] +=1

        pos_word_count = label_word_fd['pos'].N()
        neg_word_count = label_word_fd['neg'].N()
        total_word_count = pos_word_count + neg_word_count

        word_scores = {}
        for word, freq in word_fd.iteritems():
            pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],
                (freq, pos_word_count), total_word_count)
            neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],
                (freq, neg_word_count), total_word_count)
            word_scores[word] = pos_score + neg_score

        best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:num_bestwords]
        bestwords = set([w for w, s in best])
        return bestwords

    def best_words_features(self,words):
        return dict([(word, True) for word in words if word in self.bestwords])
    
    def best_bigrams_word_features(self,words, measure=BigramAssocMeasures.chi_sq, nbigrams=200):
        bigram_finder = BigramCollocationFinder.from_words(words)
        bigrams = bigram_finder.nbest(measure, nbigrams)
        d = dict([(bigram, True) for bigram in bigrams])
        d.update(best_words_features(words))
        return d
```
Three methods are written to select words in the preceding code:

- stopword_filtered_words_features: Eliminates the stopwords using the Natural Language Toolkit (NLTK) list of conjunctions and considers the rest as relevant words
- best_words_features: Using the X2 measure (NLTK library), the most informative words related to positive or negative reviews are selected (see Chapter 4, Web Mining Techniques, for further details)
- best_bigrams_word_features: Uses the X2 measure (NLTK library) to find the 200 most informative bigrams from the set of words (see Chapter 4, Web Mining Techniques, for further details)


The chosen classifier is the Naive Bayes algorithm (see Chapter 3, Supervised Machine Learning) and the labeled text (positive, negative sentiment) is taken from the NLTK.corpus of movie_reviews. To install it, open a terminal in Python and install movie_reviews from corpus:
```python
nltk.download()--> corpora/movie_reviews corpus
```
# Deleting an already performed query
Since we can specify different parameters (such as the feature selection method, the number of best words, and so on), we may want to perform and store again the sentiment of the reviews with different values. The delete_query command is needed for this purpose and it is as follows:
```python
from pages.models import Link,Page,SearchTerm
from django.core.management.base import BaseCommand, CommandError
from optparse import make_option

class Command(BaseCommand):
    option_list = BaseCommand.option_list + (
                make_option('-s', '--searchid',
                             dest='searchid', type='int',
                             action='store',
                             help=('id of the search term to delete')),)

    def handle(self, *args, **options):
         searchid = options['searchid']
         if searchid == None:
             print "please specify searchid: python manage.py --searchid=--"
             #list
             for sobj in SearchTerm.objects.all():
                 print 'id:',sobj.id,"  term:",sobj.term
         else:
             print 'delete...'
             search_obj = SearchTerm.objects.get(id=searchid)
             pages = search_obj.pages.all()
             pages.delete()
             links = search_obj.links.all()
             links.delete()
             search_obj.delete()
```
If we run the command without specifying the searchid (the ID of the query), the list of all the queries and related IDs will be shown. After that we can choose which query we want to delete by typing the following:
```python
python manage.py delete_query --searchid=VALUE
```

We can use the cached sentiment analysis model to show the user the online sentiment of the chosen movie, as we explain in the following section.

# Sentiment reviews analyser – Django views and HTML

Most of the code explained in this chapter (commands, Bing search engine, Scrapy, and Django models) is used in the function analyzer in views.py to power the home webpage shown in the Application usage overview section (after declaring the URL in the urls.py file as
```python
url(r'^','webmining_server.views.analyzer')).
```

```python
def analyzer(request):
    context = {}

    if request.method == 'POST':
        post_data = request.POST
        query = post_data.get('query', None)
        if query:
            return redirect('%s?%s' % (reverse('webmining_server.views.analyzer'),
                                urllib.urlencode({'q': query})))   
    elif request.method == 'GET':
        get_data = request.GET
        query = get_data.get('q')
        if not query:
            return render_to_response(
                'movie_reviews/home.html', RequestContext(request, context))

        context['query'] = query
        stripped_query = query.strip().lower()
        urls = []
        
        if test_mode:
           urls = parse_bing_results()
        else:
           urls = bing_api(stripped_query)
           
        if len(urls)== 0:
           return render_to_response(
               'movie_reviews/noreviewsfound.html', RequestContext(request, context))
        if not SearchTerm.objects.filter(term=stripped_query).exists():
           s = SearchTerm(term=stripped_query)
           s.save()
           try:
               #scrape
               cmd = 'cd ../scrapy_spider & scrapy crawl scrapy_spider_reviews -a url_list=%s -a search_key=%s' %('\"'+str(','.join(urls[:num_reviews]).encode('utf-8'))+'\"','\"'+str(stripped_query)+'\"')
               os.system(cmd)
           except:
               print 'error!'
               s.delete()
        else:
           #collect the pages already scraped 
           s = SearchTerm.objects.get(term=stripped_query)
           
        #calc num pages
        pages = s.pages.all().filter(review=True)
        if len(pages) == 0:
           s.delete()
           return render_to_response(
               'movie_reviews/noreviewsfound.html', RequestContext(request, context))
               
        s.num_reviews = len(pages)
        s.save()
         
        context['searchterm_id'] = int(s.id)

        #train classifier with nltk
        def train_clf(method):
            ...           
        def stopword_filtered_words_features(words):
            ... 
        #Eliminate Low Information Features
        def GetHighInformationWordsChi(num_bestwords):
            ...            
        bestwords = cache.get('bestwords')
        if bestwords == None:
            bestwords = GetHighInformationWordsChi(num_bestwords)
        def best_words_features(words):
            ...       
        def best_bigrams_words_features(words, measure=BigramAssocMeasures.chi_sq, nbigrams=200):
            ...
        clf = cache.get('clf')
        if clf == None:
            clf = train_clf(method_selfeatures)

        cntpos = 0
        cntneg = 0
        for p in pages:
            words = p.content.split(" ")
            feats = best_words_features(words)#bigram_word_features(words)#stopword_filtered_word_feats(words)
            #print feats
            str_sent = clf.classify(feats)
            if str_sent == 'pos':
               p.sentiment = 1
               cntpos +=1
            else:
               p.sentiment = -1
               cntneg +=1
            p.save()

        context['reviews_classified'] = len(pages)
        context['positive_count'] = cntpos
        context['negative_count'] = cntneg
        context['classified_information'] = True
    return render_to_response(
        'movie_reviews/home.html', RequestContext(request, context))
```
The inserted movie title is stored in the query variable and sent to the bing_api function to collect review's URL. The URL are then scraped calling Scrapy to find the review texts, which are processed using the clf classifier model and the selected most informative words (bestwords) retrieved from the cache (or the same model is generated again in case the cache is empty). The counts of the predicted sentiments of the reviews (positive_counts, negative_counts, and reviews_classified) are then sent back to the home.html (the templates folder) page, which uses the following Google pie chart code:
```python
        <h2 align = Center>Movie Reviews Sentiment Analysis</h2>
        <div class="row">
        <p align = Center><strong>Reviews Classified : {{ reviews_classified }}</strong></p>
        <p align = Center><strong>Positive Reviews : {{ positive_count }}</strong></p>
        <p align = Center><strong> Negative Reviews : {{ negative_count }}</strong></p>
        </div> 
  <section>
      <script type="text/javascript" src="https://www.google.com/jsapi"></script>
      <script type="text/javascript">
        google.load("visualization", "1", {packages:["corechart"]});
        google.setOnLoadCallback(drawChart);
        function drawChart() {
          var data = google.visualization.arrayToDataTable([
            ['Sentiment', 'Number'],
            ['Positive',     {{ positive_count }}],
            ['Negative',      {{ negative_count }}]
          ]);
          var options = { title: 'Sentiment Pie Chart'};
          var chart = new google.visualization.PieChart(document.getElementById('piechart'));
          chart.draw(data, options);
        }
      </script>
        <p align ="Center" id="piechart" style="width: 900px; height: 500px;display: block; margin: 0 auto;text-align: center;" ></p>
      </div>
```
The function drawChart calls the Google PieChart visualization function, which takes as input the data (the positive and negative counts) to create the pie chart. To have more details about how the HTML code interacts with the Django views, refer to Chapter 6, Getting Started with Django, in the URL and views behind html web pages section. From the result page with the sentiment counts (see the Application usage overview section), the PagerRank relevance of the scraped reviews can be calculated using one of the two links at the bottom of the page. The Django code behind this operation is discussed in the following section.

# PageRank: Django view and the algorithm code
To rank the importance of the online reviews, we have implemented the PageRank algorithm (see Chapter 4, Web Mining Techniques, in the Ranking: PageRank algorithm section) into the application. The pgrank.py file in the pgrank folder within the webmining_server folder implements the algorithm that follows:
```python
from pages.models import Page,SearchTerm

num_iterations = 100000
eps=0.0001
D = 0.85

def pgrank(searchid):
    s = SearchTerm.objects.get(id=int(searchid))
    links = s.links.all()
    from_idxs = [i.from_id for i in links ]
    # Find the idxs that receive page rank 
    links_received = []
    to_idxs = []
    for l in links:
        from_id = l.from_id
        to_id = l.to_id
        if from_id not in from_idxs: continue
        if to_id  not in from_idxs: continue
        links_received.append([from_id,to_id])
        if to_id  not in to_idxs: to_idxs.append(to_id)
        
    pages = s.pages.all()
    prev_ranks = dict()
    for node in from_idxs:
        ptmp  = Page.objects.get(id=node)
        prev_ranks[node] = ptmp.old_rank
        
    conv=1.
    cnt=0
    while conv>eps or cnt<num_iterations:
        next_ranks = dict()
        total = 0.0
        for (node,old_rank) in prev_ranks.items():
            total += old_rank
            next_ranks[node] = 0.0
        
        #find the outbound links and send the pagerank down to each of them
        for (node, old_rank) in prev_ranks.items():
            give_idxs = []
            for (from_id, to_id) in links_received:
                if from_id != node: continue
                if to_id  not in to_idxs: continue
                give_idxs.append(to_id)
            if (len(give_idxs) < 1): continue
            amount = D*old_rank/len(give_idxs)
            for id in give_idxs:
                next_ranks[id] += amount
        tot = 0
        for (node,next_rank) in next_ranks.items():
            tot += next_rank
        const = (1-D)/ len(next_ranks)
        
        for node in next_ranks:
            next_ranks[node] += const
        
        tot = 0
        for (node,old_rank) in next_ranks.items():
            tot += next_rank
        
        difftot = 0
        for (node, old_rank) in prev_ranks.items():
            new_rank = next_ranks[node]
            diff = abs(old_rank-new_rank)
            difftot += diff
        conv= difftot/len(prev_ranks)
        cnt+=1
        prev_ranks = next_ranks

    for (id,new_rank) in next_ranks.items():
        ptmp = Page.objects.get(id=id)
        url = ptmp.url
    
    for (id,new_rank) in next_ranks.items():
        ptmp = Page.objects.get(id=id)
        ptmp.old_rank = ptmp.new_rank
        ptmp.new_rank = new_rank
        ptmp.save()
```

This code takes all the links stores associated with the given SearchTerm object and implements the PageRank score for each page i at time t, where P(i) is given by the recursive equation:

<img src="./picture/B05143_08_05.jpg" width=300 />

PageRank: Django view and the algorithm code
Here, N is the total number of pages, and<img src="./picture/B05143_08_06.jpg" width=100 /> PageRank: Django view and the algorithm code(Nj is the number of out links of page j) if page j points to i; otherwise, N is 0. The parameter D is the so-called damping factor (set to 0.85 in the preceding code), and it represents the probability to follow the transition given by the transition matrix A. The equation is iterated until the convergence parameter eps is satisfied or the maximum number of iterations, num_iterations, is reached. The algorithm is called by clicking either scrape and calculate page rank (may take a long time) or calculate page rank links at the bottom of the home.html page after the sentiment of the movie reviews has been displayed. The link is linked to the function pgrank_view in the views.py (through the declared URL in urls.py: url(r'^pg-rank/(?P<pk>\d+)/','webmining_server.views.pgrank_view', name='pgrank_view')):
```python
def pgrank_view(request,pk): 
    context = {}
    get_data = request.GET
    scrape = get_data.get('scrape','False')
    s = SearchTerm.objects.get(id=pk)
    
    if scrape == 'True':
        pages = s.pages.all().filter(review=True)
        urls = []
        for u in pages:
            urls.append(u.url)
        #crawl
        cmd = 'cd ../scrapy_spider & scrapy crawl scrapy_spider_recursive -a url_list=%s -a search_id=%s' %('\"'+str(','.join(urls[:]).encode('utf-8'))+'\"','\"'+str(pk)+'\"')
        os.system(cmd)

    links = s.links.all()
    if len(links)==0:
       context['no_links'] = True
       return render_to_response(
           'movie_reviews/pg-rank.html', RequestContext(request, context))
    #calc pgranks
    pgrank(pk)
    #load pgranks in descending order of pagerank
    pages_ordered = s.pages.all().filter(review=True).order_by('-new_rank')
    context['pages'] = pages_ordered
    
    return render_to_response(
        'movie_reviews/pg-rank.html', RequestContext(request, context)) 
```
This code calls the crawler to collect all the linked pages to the reviews and calculate the PageRank scores using the code discussed earlier. Then the scores are displayed in the pg-rank.html page (in descending order by page rank score) as we showed in the Application usage overview section of this chapter. Since this function can take a long time to process (to crawl thousands of pages), the command run_scrapelinks.py has been written to run the Scrapy crawler (the reader is invited to read or modify the script as they like as an exercise).

# Admin and API
As the last part of the chapter, we describe briefly some possible admin management of the model and the implementation of an API endpoint to retrieve the data processed by the application. In the pages folder, we can set two admin interfaces in the admin.py file to check the data collected by the SearchTerm and Page models:
```python
from django.contrib import admin
from django_markdown.admin import MarkdownField, AdminMarkdownWidget
from pages.models import SearchTerm,Page,Link

class SearchTermAdmin(admin.ModelAdmin):
    formfield_overrides = {MarkdownField: {'widget': AdminMarkdownWidget}}
    list_display = ['id', 'term', 'num_reviews']
    ordering = ['-id']
    
class PageAdmin(admin.ModelAdmin):
    formfield_overrides = {MarkdownField: {'widget': AdminMarkdownWidget}}
    list_display = ['id', 'searchterm', 'url','title','content']
    ordering = ['-id','-new_rank']
    
admin.site.register(SearchTerm,SearchTermAdmin)
admin.site.register(Page,PageAdmin)
admin.site.register(Link)
```
Note that both SearchTermAdmin and PageAdmin display objects with decreasing ID (and new_rank in the case of PageAdmin). The following screenshot is an example:

Admin and API
Note that although it is not necessary, the Link model has also been included in the admin interface (admin.site.register(Link)). More interestingly, we can set up an API endpoint to retrieve the sentiment counts related to a movie's title. In the api.py file inside the pages folder, we can have the following:
```python
from rest_framework import views,generics
from rest_framework.permissions import AllowAny
from rest_framework.response import Response
from rest_framework.pagination import PageNumberPagination
from pages.serializers import SearchTermSerializer
from pages.models import SearchTerm,Page

class LargeResultsSetPagination(PageNumberPagination):
    page_size = 1000
    page_size_query_param = 'page_size'
    max_page_size = 10000
  
class SearchTermsList(generics.ListAPIView):

    serializer_class = SearchTermSerializer
    permission_classes = (AllowAny,)
    pagination_class = LargeResultsSetPagination
    
    def get_queryset(self):
        return SearchTerm.objects.all()  
        
class PageCounts(views.APIView):

    permission_classes = (AllowAny,)
    def get(self,*args, **kwargs):
        searchid=self.kwargs['pk']
        reviewpages = Page.objects.filter(searchterm=searchid).filter(review=True)
        npos = len([p for p in reviewpages if p.sentiment==1])
        nneg = len(reviewpages)-npos
        return Response({'npos':npos,'nneg':nneg})
```
The PageCounts class takes as input the ID of the search (the movie's title) and it returns the sentiments, that is, positive and negative counts, for the movie's reviews. To get the ID of earchTerm from a movie's title, you can either look at the admin interface or use the other API endpoint SearchTermsList; this simply returns the list of the movies' titles together with the associated ID. The serializer is set on the serializers.py file:
```python
from pages.models import SearchTerm
from rest_framework import serializers
        
class SearchTermSerializer(serializers.HyperlinkedModelSerializer):
    class Meta:
        model = SearchTerm
        fields = ('id', 'term')
```
To call these endpoints, we can again use the swagger interface (see Chapter 6, Getting Started with Django) or use the curl command in the terminal to make these calls. For instance:
```python
curl -X GET localhost:8000/search-list/
{"count":7,"next":null,"previous":null,"results":[{"id":24,"term":"the martian"},{"id":27,"term":"steve jobs"},{"id":29,"term":"suffragette"},{"id":39,"term":"southpaw"},{"id":40,"term":"vacation"},{"id":67,"term":"the revenant"},{"id":68,"term":"batman vs superman dawn of justice"}]}
```
and
```python
curl -X GET localhost:8000/pages-sentiment/68/
{"nneg":3,"npos":15}
```

# Summary
In this chapter, we described a movie review sentiment analyzer web application to make you familiar with some of the algorithms and libraries we discussed in Chapter 3, Supervised Machine Learning, Chapter 4, Web Mining Techniques, and Chapter 6, Getting Started with Django.

This is the end of a journey: by reading this book and experimenting with the codes provided, you should have acquired significant practical knowledge about the most important machine learning algorithms used in the commercial environment nowadays.

You should be now ready to develop your own web applications and ideas using Python and some machine learning algorithms, learned by reading this book. Many challenging data-related problems are present in the real world today, waiting to be solved by people who can grasp and apply the material treated in this book, and you, who have arrived at this point, are certainly one of those people.