## **Getting your text data into Turi Create**
Whether you are cleaning your own data or using an open source training data set, Turi Create requires a somewhat specific format for building a text classifier. The data we are using here is from http://www.cs.jhu.edu/~mdredze/datasets/sentiment.

**TL;DR** : Easiest way to get a text classifier running using Turi Create is to have a .csv with 2 columns:
- **rating** : a numeric value indiciating good or bad or a rating on a 5 star scale
- **text** : free text in sentence form. 
  - *Note: many open source training data sets tokenize the words for you. This is not helpful when using Turi Create to build a model.*

In [1]:
import pandas as pd
import turicreate as tc 

In [2]:
# some helper functions to clean up our review data
def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

# clean up string ratings by converting to int, required by Turi Create
def try_int(string):
    try:
        _int = int(string.split(".")[0])
        return _int
    except:
        return None

# this was created specifically for this dataset
def clean_review(rev):
    text_review = find_between(rev, "<review_text>\n", "</review_text>").replace("\n", " ")
    rating = find_between(rev, "<rating>\n", "\n</rating>")
    return {'rating' : try_int(rating), 'review' : text_review}


## **Load the data**
We will build a model only on the electronics reviews in this data set. We load the separate positive and negative reviews.

In [3]:
## assuming you've downloaded the above linked dataset and extracted its contents
## load the positive and negative review data
with open("./sorted_data/electronics/positive.review", "r", encoding="ISO-8859-1") as infile:
    positive_lines = infile.read()
    positive_reviews = positive_lines.split("</review>")
with open("./sorted_data/electronics/negative.review", "r", encoding="ISO-8859-1") as infile:
    negative_lines = infile.read()
    negative_reviews = negative_lines.split("</review>")

## **Clean the data**
The raw data was fairly messy and in a format similar to html. The below cell uses a helper function and parses in between tags to pull out the data we want.

In [4]:
# clean up the reviews using a function we wrote above
cleaned_reviews = [clean_review(r) for r in positive_reviews + negative_reviews]
review_df = pd.DataFrame.from_records(cleaned_reviews).dropna()
review_df['rating']= review_df['rating'].astype(int)
review_sframe = tc.SFrame(review_df)

In [5]:
review_sframe.head(3) # take a look at the data

rating,review
5,I received my Kingston 256MB SD card just as ...
4,"Works well, especially for anyone who still has ..."
4,Not as easy to use as a larger Panasonic I used ...


## **Build a model**
Below we build the model, passing our two necessary columns to the function.

In [6]:
# build the model
model = tc.text_classifier.create(review_sframe, features=['review'], target='rating')

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [7]:
# predict a bad review
model.predict(tc.SFrame({'review' : ['This product didnt meet expectations']}))

dtype: int
Rows: 1
[2]

In [8]:
# predict a good review
model.predict(tc.SFrame({'review' : ['This product far exceeded my expectations']}))

dtype: int
Rows: 1
[5]