# Session 4 - Acquiring data

## Session agenda
1. How data is stored.
2. Basic file formats: csv, json, hdf5.
3. Working with different formats in Python.
4. Text encodings, Python’s standard library codecs module. Encoding and decoding data.
5. Where to get data: data repositories (UCI Machine Learning Repository, sklearn datasets and other). 

## Discussion - how data is stored?
Let us discuss this point:
1. What is important we talk about storing data?
2. What features of data affect its storage?
3. Why there is no universal data storage format?
4. Give examples when data storage format can affect significantly the performed task.

## Basic file formats
I am sure you have encountered a significant number of data storing formats. Let us just review a couple of them.

### Comma-separated values (CSV)
It is the most popular way to store tabular data. Instead of comma any other separator can be used (e.g. tab, white space, dot, colon and etc.).

Reference:
1. https://en.wikipedia.org/wiki/Comma-separated_values

### JavaScript Object Notation (JSON)
A very popular format for representing data in a form of attribute-value pairs. Commonly used for transmitting data objects in the course of browser/server communication. It can also be used to store hierarchical data.

Let us check an example:
```json
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": null
}
```

Python has a module for working with JSON data format in the standart library (json). 

Reference:
1. https://en.wikipedia.org/wiki/JSON

In [None]:
import json
#Encoding
a = ['foo', {'bar': ('baz', None, 1.0, 2)}]
print(type(a), type(a[1]))
print(a)
print(json.dumps(a))
print(json.dumps({"c": 0, "b": 0, "a": 0}, sort_keys=True))
{"a": 0, "b": 0, "c": 0}
print(json.dumps([1,2,3,{'4': 5, '6': 7}], separators=(',', ':')))
print(json.dumps({'4': 5, '6': 7}, sort_keys=True, indent=4))

#Decoding
b = json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]')
print(b)
print(type(b),type(b[1]))

#Streaming API
from io import StringIO
io = StringIO()
json.dump(['streaming API'], io)
print(io.getvalue())

io = StringIO('["streaming API"]')
print(json.load(io))


## Encoding and decoding data
When working with data, which is send of the internet, you might need to encode or decode it. 

In [None]:
import json, urllib.request

url = "http://mks2.cs.msu.ru/"
with urllib.request.urlopen(url) as response:
    encoded = response.read()
    print(encoded)
    response_encoding = response.headers.get_content_charset('charset')
    print('After decoding:\n')
    decoded = encoded.decode(encoding = response_encoding)
    print(decoded)

## Hierarchical Data Format (HDF)
Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. Originally developed at the National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

The current version, HDF5, differs significantly in design and API from the major legacy version HDF4. HDF5 simplifies the file structure to include only two major types of object:

* Datasets, which are multidimensional arrays of a homogeneous type
* Groups, which are container structures which can hold datasets and other groups

To use HDF files with Python usually additional packages are needed (a number of formats were created on the basis of HDF5 format and usually need specific libraries to handle them).

We will be using http://unidata.github.io/netcdf4-python/ for our next example. Let us explore ".CDF" files from the "Data/LS_MS Data CDF"

## Where to get data
There are numerous ways to get data. Let us discuss the most typical cases:

1. Raw data - data collected from an experiment, survey, automated monitoring of a certain process and etc.
2. Data repositories - web-sites, which contains preprocessed data, which is stored in typical data storage formats and can be used for data analysis.
  1. UCI Machine learning repository (http://archive.ics.uci.edu/ml/index.html)
  2. kaggle (https://www.kaggle.com/datasets)
  3. scikit-learn datasets (http://scikit-learn.org/stable/datasets/)
  3. Field or application specific repositories - just check the web - you will find a lot of different options

3. Databases - relational (SQL) and non-relational (NoSQL) databases.

We can use iPython notebook to load data directly from the repository. Check the next example.

In [None]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
iris = pd.read_csv(url, names=names)
iris.head(5)