# Raw data are **sacrosanct**

<blockquote class="twitter-tweet" data-lang="en-gb"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb?ref_src=twsrc%5Etfw">@tomjwebb</a> don&#39;t, not even with a barge pole, not for one second, touch or otherwise edit the raw data files. Do any manipulations in script</p>&mdash; Gavin Simpson (@ucfagls) <a href="https://twitter.com/ucfagls/status/556107371634634755?ref_src=twsrc%5Etfw">16 January 2015</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 


# Getting started with your data 

### Should I version control my raw data?
<div class='info'> The raw data *should* never change. All of the processing or data wrangling must be done in copies of your raw data. </div>

<br>
However, it is important to consider how you are going to share your data. The [best practices for sharing data on the Web summary](https://www.w3.org/TR/dwbp/#bp-summary) is a good place to get started.

You can also explore alternatives like [git annex](https://git-annex.branchable.com/) and [Datalad](http://datalad.org/) to version control your data, or [FigShare](https://figshare.com/) to share your data.


# Adding raw data

We will be using the Kaggle [wine review](https://www.kaggle.com/residentmario/renaming-combining-data/data) dataset.
The first step is to download the data and store it in our `data/raw` directory.

You should have a copy of the datasets that we will be using already if you followed the installation instructions. 

Otherwise you can get a copy at [https://drive.google.com/drive/u/1/folders/1b2B0KWS0UAVQqFgzx2R2qMNeiiB98lMe?usp=sharing](https://drive.google.com/drive/u/1/folders/1b2B0KWS0UAVQqFgzx2R2qMNeiiB98lMe?usp=sharing)

# You got data... is it enough? Data without documentation has no value
### metadata = data about data 

Information that describes, explains, locates or makes it easier
to <strong>find, access, and use</strong> a resource

<img src="assets/meta.jpg" alt="metadata" width='300px'>
<img src="assets/metadata.png" alt="metadata" >

## Adding metadata

You want to make sure that all your data has information describing how you got the data, the meaning of the columns, date of last access, data version, etc. 

For your own use make sure to create at least a `README.txt` file describing the data as best as you can.

Create a README.txt ( or .md) file inside the data directory and add the following content or something similar.

```
Data collected from Kaggle winemag reviews
https://www.kaggle.com/residentmario/renaming-combining-data/data

Collected on 09/05/2018 by Tania Allard
```


Commit both the `README` and the data to the repository.

<h1> Diving deeper into metadada </h1>

What would someone outside your research project (or you in 5 years time) need to know about your data to build on your work.

<table>
    <tr>
        <th>What</th>
        <th>Such as..</th>
    </tr>
<tr>
<td>What is the title of the dataset?</td>
    <td>The regulation of emotions in adolescents: age differences
        and emotion-specific patterns</td>
</tr>
 <tr>
    <td>Are there any related publications or data sets?</td>
    <td>Theurel A, Gentaz E (2018) The regulation of emotions in
        adolescents: age differences and emotion-specific patterns.
        PLOS ONE 13(6): e0195501. <a href="https://doi.org/10.1371/journal.pone.0195501">
        https://doi.org/10.1371/journal.pone.0195501</a> </td>
 </tr>
</table>

<table>
<tr>
    <th>Where</th>
    <th>Such as..</th>
</tr>
<tr>
<td>Where was the data collected?</td>
<td>France, Rhones Alpes</td>
</tr>
<tr>
<td>Where does the data live?</td>
<td>Theurel A, Gentaz E (2018) Data from: The regulation of emotions
    in adolescents: age differences and emotion-specific patterns.
    Dryad Digital Repository. <a href="https://doi.org/10.5061/dryad.n230404">
    https://doi.org/10.5061/dryad.n230404</a> </td>
</tr>
</table>

<table>
     <tr>
         <th>Who</th>
         <th>Such as..</th>
     </tr>
    <tr>
     <td>Who is responsible for the data?</td>
     <td>Dr Theurel, Anne</td>
    </tr>
    <th>When</th>
    <th>Such as..</th>
    <tr>
        <td>Was the data collected? What time span does the data cover?	</td>
        <td>Collected: June 2015. Data coverage: 1932-1944</td>
    </tr>
</table>

Additional resources on schemas and metadata:
- [https://guides.lib.unc.edu/c.php?g=8749&p=44504](https://guides.lib.unc.edu/c.php?g=8749&p=44504)
- [https://www.lib.ncsu.edu/data-management/metadata#best]*https://www.lib.ncsu.edu/data-management/metadata#best)
- [http://mozillascience.github.io/checklist/](http://mozillascience.github.io/checklist/)


We are now going to use the [datapackage](https://github.com/frictionlessdata/datapackage-py) package to create some metadata for our dataset.

Let's first remember how the data looks like:

In [1]:
import pandas as pd
wine = pd.read_csv('solutions/data/raw/winemag-data-130k-v2.csv', index_col=0)
wine.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


The `datapackage` allows you to work with data packages, so we start by creating a blank data package like so:

In [2]:
import datapackage
package = datapackage.Package()

We can now add useful metadata by addding keys to the metadata attribute dictionary. We will start by adding the `name` key and the human-readable `title` key. For a list of the keys supported check the [DataPackage spec](https://frictionlessdata.io/specs/data-package/#metadata)

In [3]:
package.descriptor['name'] = 'winemag-reviews'
package.descriptor['title'] = 'Winemag wine reviews dataset'
package.descriptor

{'profile': 'data-package',
 'name': 'winemag-reviews',
 'title': 'Winemag wine reviews dataset'}

## Inferring the data schema
The next ste would then be to infer the data schema and generate additional metadata from our datasets

In [4]:
# Some path manipulation might be needed... 
import pathlib
import os 

In [5]:
package.infer('./solutions/data/**/*.csv')

{'profile': 'tabular-data-package',
 'resources': [{'path': 'solutions/data/chile.csv',
   'profile': 'tabular-data-resource',
   'name': 'chile',
   'format': 'csv',
   'mediatype': 'text/csv',
   'encoding': 'windows-1252',
   'schema': {'fields': [{'name': 'country',
      'type': 'string',
      'format': 'default'},
     {'name': 'designation', 'type': 'string', 'format': 'default'},
     {'name': 'points', 'type': 'integer', 'format': 'default'},
     {'name': 'price', 'type': 'number', 'format': 'default'},
     {'name': 'price_GBP', 'type': 'number', 'format': 'default'}],
    'missingValues': ['']}},
  {'path': 'solutions/data/raw/winemag-data-130k-v2.csv',
   'profile': 'tabular-data-resource',
   'name': 'winemag-data-130k-v2',
   'format': 'csv',
   'mediatype': 'text/csv',
   'encoding': 'utf-8',
   'schema': {'fields': [{'name': '', 'type': 'integer', 'format': 'default'},
     {'name': 'country', 'type': 'string', 'format': 'default'},
     {'name': 'description', 'type'

In [8]:
len(package.resources)

2

The `infer` method has found all our files and inspected it to extract useful metadata like profile, encoding, format, Table Schema etc. Let's have a look at the resource:

In [10]:
package.descriptor['resources'][1]

{'path': 'solutions/data/raw/winemag-data-130k-v2.csv',
 'profile': 'tabular-data-resource',
 'name': 'winemag-data-130k-v2',
 'format': 'csv',
 'mediatype': 'text/csv',
 'encoding': 'utf-8',
 'schema': {'fields': [{'name': '', 'type': 'integer', 'format': 'default'},
   {'name': 'country', 'type': 'string', 'format': 'default'},
   {'name': 'description', 'type': 'string', 'format': 'default'},
   {'name': 'designation', 'type': 'string', 'format': 'default'},
   {'name': 'points', 'type': 'integer', 'format': 'default'},
   {'name': 'price', 'type': 'number', 'format': 'default'},
   {'name': 'province', 'type': 'string', 'format': 'default'},
   {'name': 'region_1', 'type': 'string', 'format': 'default'},
   {'name': 'region_2', 'type': 'string', 'format': 'default'},
   {'name': 'taster_name', 'type': 'string', 'format': 'default'},
   {'name': 'taster_twitter_handle', 'type': 'string', 'format': 'default'},
   {'name': 'title', 'type': 'string', 'format': 'default'},
   {'name': '

We might want to give this a better name:

In [19]:
package.descriptor['resources'][1]['name'] = 'winemag-reviews'
package.descriptor['resources'][1]

{'path': 'solutions/data/raw/winemag-data-130k-v2.csv',
 'profile': 'tabular-data-resource',
 'name': 'winemag-reviews',
 'format': 'csv',
 'mediatype': 'text/csv',
 'encoding': 'utf-8',
 'schema': {'fields': [{'name': '', 'type': 'integer', 'format': 'default'},
   {'name': 'country', 'type': 'string', 'format': 'default'},
   {'name': 'description', 'type': 'string', 'format': 'default'},
   {'name': 'designation', 'type': 'string', 'format': 'default'},
   {'name': 'points', 'type': 'integer', 'format': 'default'},
   {'name': 'price', 'type': 'number', 'format': 'default'},
   {'name': 'province', 'type': 'string', 'format': 'default'},
   {'name': 'region_1', 'type': 'string', 'format': 'default'},
   {'name': 'region_2', 'type': 'string', 'format': 'default'},
   {'name': 'taster_name', 'type': 'string', 'format': 'default'},
   {'name': 'taster_twitter_handle', 'type': 'string', 'format': 'default'},
   {'name': 'title', 'type': 'string', 'format': 'default'},
   {'name': 'varie

Because our resources are tabular we could read it as a tabular data:

In [None]:
package.get_resource('chile').read(keyed=True)

We can now save the contents of the descriptor into a `.json` file and a zip file. Every resource which content lives in the local filesystem will be copied to the zip file.
The final structure of the zip file will be:
```
./datapackage.json
./data/local.csv
```

In [15]:
package.save('solutions/data/datapackage.json')

True

In [16]:
package.save('solutions/data/datapackage.zip')

True

If you want to learn more about the `datapackage` package visit the GitHub repository [https://github.com/frictionlessdata/datapackage-py](https://github.com/frictionlessdata/datapackage-py)

In [2]:
from IPython.core.display import HTML


def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()