# Data analysis of book reviews from Amazon

This notebook contains a set of book reviews from Amazon.  The first cell of the notebook will download the data and read it into a Pandas data frame.  

Your job is to do some basic analysis of the data.

In [None]:
import pandas as pd
reviews = pd.read_csv("https://pythondata.blob.core.windows.net/data/Book%20Reviews%20from%20Amazon.tsv", sep='\t', names=['Score', 'Description'])
train, test = reviews[:len(reviews)//2], reviews[len(reviews)//2:]
print('There are', len(reviews), 'reviews total')

We'll get you started by showing you some of the rows of the data.  Pandas makes it easy to look at the data using Python's slicing syntax:  

In [None]:
reviews[:10]

Pandas also makes it easy to select just a single column which you can do various operations on.  For example below we get the description column for the first 20 rows and get all of the unique words out:

In [None]:
set('\n'.join(reviews['Description'][:20]).split())

Now can you use Pandas to find the highest, lowest, and mean scores?

Next, can you write a Pandas expression which gets all of the highest scoring reviews?

Below we've written a function which returns a score for a given review.  Unfortunately we didn't do a very good job on it and it just returns a random value.  Can you make it better?  You can go wild here with whatever solution you'd like!  

One simple possibility is looking for known words like "great", "fantastic", "bad", "horrible", etc...  

If you're feeling more ambitious try using one of Python's great libraries for this.  scikit-learn includes feature extraction like TfidfVectorizer and CountVectorizer and many different classifiers like LinearSVC.     

In [None]:
import random

def score(description):
    return random.randint(1, 5)


Finally, let's see how you did!  We'll run your function against the book reviews.

In [None]:
right = wrong = 0
for review in test.values:
    result = score(review[1])
    if result != review[0]:
        wrong += 1
    else:
        right += 1
        
print(f'Your algorithm scored {right} right and {wrong} wrong ({right*100//len(test)}% correct)')