# ML bootcamp hackthon challenge: sentiment analysis

At this stage you have learned what machine learning is and you have seen examples of how to use machine learning. Now it is time to see how well YOU can generalize what you have learned to a new challenge.

The hackathon has two tracks: in the **Ideation Track** you can brainstorm on what great apps we can build for the enterprise with this new technology. What are the 'jobs to be done' that machine learning can solve for us in the future? Create an Intrapreneurship-style pitch and present it to the team on Friday. 
https://jam4.sapjam.com/groups/r7ILMAl5MxS8rgHSaNR8L9/overview_page/46399

In the **Data Science Track**, we will run a hackthon challenge on a popular machine learning task: predicting consumer sentiment from online reviews. You are given a set of movie reviews and their labels (positive, negative) and you have to build a system that can predict the sentiment for a new movie review.

We are using the popular polarity data set from Cornell University.
http://www.cs.cornell.edu/people/pabo/movie-review-data/
 

In [None]:
## download the movie review data set
import sys, os 
import urllib.request

# set http proxy env variable
import os
os.environ['http_proxy'] = 'proxy.sin.sap.corp:8080'

# download file to /tmp
url = "http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz"
file_name = '/tmp/review_polarity.tar.gz'
if not os.path.isfile(file_name):
    with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
        data = response.read() # a `bytes` object
        out_file.write(data)
    out_file.close()

# make file world readable
os.chmod(file_name, 0o755)

In [None]:
# extract file to /tmp dir
import tarfile
import os.path


if not os.path.isdir('/tmp/txt_sentoken'):
    os.chdir('/tmp')
    tar = tarfile.open(file_name, "r:gz")
    tar.extractall()
    tar.close()
    

# make files world readable
for label in ['pos', 'neg']:
    os.chmod('/tmp/txt_sentoken/%s/' % label, 0o755)
    for file_name in os.listdir('/tmp/txt_sentoken/%s/' % label):
        os.chmod(os.path.join('/tmp/txt_sentoken/%s/' % label, file_name), 0o755)


In [None]:
# load positive and negative movie review text

# list to store the examples
reviews = []

# load positive reviews
for file_name in os.listdir('/tmp/txt_sentoken/pos/'):
    with open(os.path.join('/tmp/txt_sentoken/pos/', file_name)) as fin:
        # append sentiment 'pos' label and review text
        reviews.append(('pos', fin.read()))
        
# load negative reviews
for file_name in os.listdir('/tmp/txt_sentoken/neg/'):
    with open(os.path.join('/tmp/txt_sentoken/neg/', file_name)) as fin:
        # append sentiment 'neg' label and review text
        reviews.append(('neg', fin.read()))

# great, there are 2000 reviews in the data set now  
print(len(reviews))

In [None]:
# randomly split data into 80% training and 20% testing
# each example is a tuple consisting of a (pos/neg)  sentiment label 
# and the text of the review
import random
random.seed(42)
random.shuffle(reviews)

size_train = int(len(reviews)*0.8)
reviews_train = reviews[:size_train]
reviews_test = reviews[size_train:]

print("size of training set:", len(reviews_train))
print("size of test set:", len(reviews_test))

# split data and text
labels_train = [ item[0] for item in reviews_train ]
corpus_train = [ item[1] for item in reviews_train]

labels_test = [ item[0] for item in reviews_test ]
corpus_test = [ item[1] for item in reviews_test]

In [None]:
# let's look at a few examples of the training set
import pprint
pp = pprint.PrettyPrinter(indent=3)
pp.pprint(reviews_train[:3])

## Level 1: data rookie
Explore the data and find out which words are good features for positive and negative reviews. Visualize the data. Remember what you have learned in the text mining notebook about pre-processing for text. 

In [None]:
## pre-process, visualize and exploree data

#  YOUR CODE GOES HERE

## Level 2: data sophomore
Build a first classifier that can predict if a given movie review is positive or negative. 
Remember what you have learned in the data science and text mining notebooks about feature extraction pipelines, training and evaluation.

In [None]:
## feature extraction, model training on the training set and evaluation on the test set using accuracy

#  YOUR CODE GOES HERE

## Level 3: data ninja
Improve your classifier, try out different pre-processing steps, features, classifier models, etc. 
Let's see who can build the most accurate classifier!!
Let's the games begin.

In [None]:
## build pipelines to test different pre-processing, feaature extraction, and machine learning models
## try to get the best accuracy on the test set

#  YOUR CODE GOES HERE

## (Optional) Level 4: the architect
Build an end-to-end system with REST APIs for sentiment analysis. Input is a movie review, output is the model prediction. You can stay withing python or use the software stack of your trust.

## (Optional) Level 5: the artist
Design a beautiful user experience how the predicted labels of the algorithm can be presented to a consumer on a mobile device. Connect it to the REST webservice or just mock up the inputs.