# Milestone 1
### Rob Lisy

# Instructions
Read the books.json data into Python.

This data is semi-structured, but for machine learning, we need to turn this data into tabular format. To do so, we need to loop through this data and extract the pieces of information that we need from each book entry. Loop through each book in the data and extract the following information:
- num_authors: number of authors (we can extract this from the list of authors)
- isbn: this can be directly extracted
- pageCount: this can be directly extracted
- title: this can be directly extracted
- desc_len: the number of words in the long description (we can extract this from the longDescription entry for each book. Use 0 if there is no longDescription entry.
- has_word_data: whether the word "data" appears in the longDescription entry of each book. This is a True / False column, also called a binary column or a flag column.

Import the pandas library as pd and use pd.DataFrame to create a structured tabular data whose columns match the list above and whose content is the content you extracted above. Show the first 20 rows of the data to make sure the data looks alright. You can do that using the .head(20) method.

### Import your libraries and data...

In [129]:
import pandas as pd
import json
# regex, baby.
import re

raw_books = pd.DataFrame(pd.read_json(r'./data/books.json'))
print(raw_books.head(2))

  _id                              title        isbn  pageCount  \
0   1                  Unlocking Android  1933988673        416   
1   2  Android in Action, Second Edition  1935182722        592   

                               publishedDate  \
0  {'$date': '2009-04-01T00:00:00.000-0700'}   
1  {'$date': '2011-01-14T00:00:00.000-0800'}   

                                        thumbnailUrl  \
0  https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....   
1  https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....   

                                    shortDescription  \
0  Unlocking Android: A Developer's Guide provide...   
1  Android in Action, Second Edition is a compreh...   

                                     longDescription   status  \
0  Android is an open source mobile phone platfor...  PUBLISH   
1  When it comes to mobile apps, Android can do a...  PUBLISH   

                                         authors             categories  
0  [W. Frank Ableson, Charlie Collins, Robi S

#### View the data types

In [114]:
raw_books.dtypes

_id                 object
title               object
isbn                object
pageCount            int64
publishedDate       object
thumbnailUrl        object
shortDescription    object
longDescription     object
status              object
authors             object
categories          object
dtype: object

### Clean the data

### We need to make the following columns:
- num_authors: number of authors (we can extract this from the list of authors)
- isbn: this can be directly extracted
- pageCount: this can be directly extracted
- title: this can be directly extracted
- desc_len: the number of words in the long description (we can extract this from the longDescription entry for each book. Use 0 if there is no longDescription entry.
- has_word_data: whether the word "data" appears in the longDescription entry of each book. This is a True / False column, also called a binary column or a flag column.

In [133]:
# Get the number of authors in the author list.
raw_books['num_authors'] = raw_books['authors'].str.len()
# Words are split by spaces... usually. Count the number of spaces, you get the number of words.
raw_books['desc_len'] = raw_books['longDescription'].str.count(' ')
# zeros for missing long descriptions
raw_books['desc_len'] = raw_books['desc_len'].fillna(0)

# Count occurances of the word data
raw_books['word_data_count'] = raw_books['longDescription'].str.count('data', re.I)
# there's some NaNs in there
raw_books['word_data_count'] = raw_books['word_data_count'].fillna(0)
# now convert that integer to a boolean
raw_books['has_word_data'] = raw_books['word_data_count'].apply(lambda x: True if x > 0 else False)

# Get rid of the unneeded columns, via a copy to a new (skinnier) dataframe
drop_cols = ['_id', 'publishedDate', 'thumbnailUrl', 
             'shortDescription', 'longDescription',
             'status', 'authors', 'categories',
             'word_data_count']

books = raw_books.drop(drop_cols, 1)

books.head(20)

Unnamed: 0,title,isbn,pageCount,num_authors,desc_len,word_data_count,has_word_data
0,Unlocking Android,1933988673,416,3,358.0,3.0,True
1,"Android in Action, Second Edition",1935182722,592,2,106.0,0.0,False
2,Specification by Example,1617290084,0,1,0.0,0.0,False
3,Flex 3 in Action,1933988746,576,2,276.0,2.0,True
4,Flex 4 in Action,1935182420,600,4,350.0,3.0,True
5,Collective Intelligence in Action,1933988312,425,1,260.0,3.0,True
6,Zend Framework in Action,1933988320,432,3,300.0,1.0,True
7,Flex on Java,1933988797,265,2,299.0,1.0,True
8,Griffon in Action,1935182234,375,4,245.0,0.0,False
9,OSGi in Depth,193518217X,325,1,280.0,0.0,False
