> This notebook truncates the original data for the class. The data is:

1. Read in as  CSV
2. Saved as Multiple CSV's separated by boro in ``../_data``
3. A description of the cleaning is available in the first blog post

In [2]:
import pandas

> Pandas can read excel files natively.  They are not optimal for read write operations.  We will use ``CSV``'s instead

In [4]:
%%time
df = pandas.read_excel( '../assets/DOHMH_New_York_City_Restaurant_Inspection_Results.xlsx' )

CPU times: user 1min 28s, sys: 956 ms, total: 1min 29s
Wall time: 1min 30s


In [5]:
boros = df[df.BORO.apply( lambda v: not v in ['Missing'])]
num_reviews = boros.CAMIS.groupby(boros.CAMIS).count()

In [10]:
boros = boros.reset_index()

> Only keep restaurants with >45 reviews.

In [13]:
top_reviewed = num_reviews[num_reviews>45].copy()
top_reviewed = boros[boros.CAMIS.apply( lambda i: i in top_reviewed.index )].copy()

In [14]:
for boro in boros.BORO.unique():
    top_reviewed[top_reviewed.BORO == boro].to_csv( '../_data/{boro}.csv'.format(boro=boro), index=False)

------
## stopped parsing data
## creating a blog post

In [40]:
import IPython, jinja2, numpy, bokeh.plotting, bokeh.charts, markdown, yaml, datetime
%matplotlib inline
bokeh.plotting.output_notebook(resources=bokeh.resources.CDN)

In [57]:
post = jinja2.Template("""
> All of the analysis for this post was created in [``/notebook/Excel_to_CSV.ipynb``](https://github.com/tonyfast/insight/blob/gh-pages/notebook/Excel_to_CSV.ipynb)
The raw Excel dataset contains __{{df.shape[0]}}__ rows.  __{{(df.BORO=='Missing').sum()}}__ were removed
because they do not have a boro identified. Also, restaurants with less 45 samples were removed; this was only done to create
a more managable dataset for the class.  A total of __{{top_reviewed.shape[0]}}__ restaurants are being consider; this is 
{{top_reviewed.shape[0]/df.shape[0]*100}}% of the raw data.
used for the class.

* The dataset has {{df.shape[1]}} columns.  The fields  are
{% for c in df.columns %}
    * __{{c}}__
{% endfor %}

A sample of the table is 

{{boros.iloc[::15000].iloc[:5].to_html()}}

# All of the Boros are accounted for.

{% for boro in boros.BORO.unique().tolist() %}
* [{{boro}}](https://github.com/tonyfast/insight/blob/gh-pages/_data/{{boro}}.csv) - {{pandas.read_csv('../_data/{boro}.csv'.format(boro=boro)).shape[0]}} rows
{% endfor %}


""").render(**globals())

post = """---
layout: post
title: Parsing the Raw Data
---
""" + post

fn = date.strftime('%Y-%m-%d') + '-' + next(yaml.safe_load_all(post))['title'].replace(' ','-') + '.markdown'
with open('../_posts/'+fn,'w') as f:
    f.write(post)