#  From Raw to Technically Correct Data
*(from de Jonge van der Loo)*

A data set is a collection of data that describes attribute values (variables) of a number of real-world objects (units). With data that are *technically correct*, we understand a data set where each value:

1. can be directly recognized as belonging to certaing variable, and
2. is stored in a data type that represents the value domain of the real-world variable.

This means that for each unit, a text variable should be stored as text, a numeric variable as a number, and so on, and all ths in a format that is consistent across the data set with appropiate variable (column) names. I am avoiding to comment the case in which the variable is categorical, as thus it would be necessary to use labels. I'll skip this, but you have further reading about this in the following link: http://pandas.pydata.org/pandas-docs/stable/categorical.html

To sum up: the mission is to find ways to read froma source of text or data (i.e. a file text) and have it transformed to a Pandas DataFrame with suitable column names.

## Reading text data to DataFrame

We have already seen in previous notes that Pandas already provides libraries that get a file as an input and provide a DataFrame as an output. This is the most siutable way of reading text in Pandas, however, we always can make use of the csv library from Python, converting and converting the data to a DataFrame using any of the DataFrame constructors. 

We will cover some of the most common uses of Pandas regarding the csv file format reading libraries, for other file formats and further options, please refer to http://pandas.pydata.org/pandas-docs/stable/io.html.

### CSV reading
Nowadays, the most common data format is CSV, which are tabular data files which use the comma separator to divide variable data.


In [4]:
fname = "../data/people.csv"
with open(fname) as f:
    content = f.readlines()
print content[:5]

[',Age[years],Sex,Weight[kg],Eye Color,Body Temperature[C]\n', 'individuum 1, 42, female, 52.9, brown, 36.9\n', 'individuum 2, 37, male, 87.0, green, 36.3\n', 'individuum 3, 29, male, 82.1, blue, 36.4\n', 'individuum 4, 61, female, 62.5, blue, 36.7\n']


In [14]:
import pandas as pd
from IPython.display import display, HTML

path_to_file = "../data/people.csv"
df = pd.read_csv(path_to_file)
display(df)
print "Types:"
display(df.dtypes)

Unnamed: 0.1,Unnamed: 0,Age[years],Sex,Weight[kg],Eye Color,Body Temperature[C]
0,individuum 1,42,female,52.9,brown,36.9
1,individuum 2,37,male,87.0,green,36.3
2,individuum 3,29,male,82.1,blue,36.4
3,individuum 4,61,female,62.5,blue,36.7
4,individuum 5,77,female,55.5,gray,36.6
5,individuum 6,33,male,95.2,green,36.5
6,individuum 7,32,female,81.8,brown,37.0
7,individuum 8,45,male,78.9,brown,36.3
8,individuum 9,18,male,83.4,green,36.6
9,individuum 10,19,male,84.7,gray,36.1


Types:


Unnamed: 0              object
Age[years]               int64
Sex                     object
Weight[kg]             float64
Eye Color               object
Body Temperature[C]    float64
dtype: object

Some things to note:
* recognizes the first line as column names
* recognizes variable types
* automatically assign row index

Useful options of read_csv: