# Importing yout data

To import your data you should use pandas. It is also possible to load one of the classics geostatistical datasets, such as Walker Lake and Jura, form our datasets module.

![pandas](figs/pandas1.png)

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. 

The two primary data structures of pandas, ``Series`` (1-dimensional) and ``DataFrame`` (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

To import Pandas:

In [1]:
import pandas as pd

For additional information on Pandas, see the [documentation](https://pandas.pydata.org/pandas-docs/stable/)

## Getting Data In/Out

Pandas is packed with a set of IO functions that returns a pandas object. Reader functions are accessed like ``pd.read_fileformat()`` while writer functions are accessed like ``DataFrame.to_fileformat()``.

Format Type	|  Data Description       |  Reader	          |Writer         |
------------|-------------------------|-------------------|---------------|
text	    |    CSV	              |   read_csv	      |  to_csv       |
text	    |    JSON	              |   read_json	      |  to_json      |
text	    |    HTML	              |   read_html	      |  to_html      | 
text	    |    Local clipboard	  |   read_clipboard  |  to_clipboard |
binary	    |    MS Excel	          |   read_excel	  |  to_excel     |
binary	    |    HDF5 Format	      |   read_hdf	      |  to_hdf       |
binary	    |    Feather Format       |   read_feather	  |  to_feather   | 
binary	    |    Parquet Format       |   read_parquet	  |  to_parquet   |
binary	    |    Msgpack	          |   read_msgpack	  |  to_msgpack   |
binary	    |    Stata	              |   read_stata	  |  to_stata     |
binary	    |    SAS	              |   read_sas        |	              |
binary	    |    Python Pickle Format |   read_pickle	  |  to_pickle    |
SQL	        |    SQL	              |   read_sql	      |  to_sql       |
SQL	        |    Google Big Query	  |   read_gbq	      |  to_gbq       |

Importing a ``.csv`` dataset:

In [2]:
path = "data/"
file = "walker.csv"

data = pd.read_csv(path + file, sep=",", na_values=-999)

## Viewing Data

``df.head(n)`` shows the first n rows (if nothing is passed n=5) of the DataFrame while ``df.tails(n)`` shows the last n lines.

In [3]:
data.head()

Unnamed: 0,Id,X,Y,V,U,T
0,1.0,11.0,8.0,0.0,,2.0
1,2.0,8.0,30.0,0.0,,2.0
2,3.0,9.0,48.0,224.4,,2.0
3,4.0,8.0,69.0,434.4,,2.0
4,5.0,9.0,90.0,412.1,,2.0


## Filtering

Accesing variables collumns:

In [4]:
data[["U", "V"]].head()

Unnamed: 0,U,V
0,,0.0
1,,0.0
2,,224.4
3,,434.4
4,,412.1


Accessing DataFrame where variable V is bigger than 640:

In [5]:
df_filter = data['V'] > 640

data[df_filter].head()

Unnamed: 0,Id,X,Y,V,U,T
18,19.0,31.0,68.0,895.2,,2.0
19,20.0,28.0,88.0,702.6,,2.0
30,31.0,49.0,11.0,653.3,,2.0
34,35.0,50.0,88.0,820.8,,2.0
37,38.0,49.0,151.0,773.3,,2.0


Accessing DataFrame where variable T is equal to 1:

In [6]:
df_filter = data["T"] == 1

data[df_filter].head()

Unnamed: 0,Id,X,Y,V,U,T
11,12.0,10.0,231.0,82.1,,1.0
12,13.0,11.0,250.0,81.1,,1.0
44,45.0,51.0,290.0,159.6,,1.0
55,56.0,69.0,208.0,97.4,,1.0
56,57.0,69.0,229.0,0.0,,1.0


``describe()`` shows a quick statistic summary of your data:

In [7]:
data[["U", "V"]].describe()

Unnamed: 0,U,V
count,275.0,470.0
mean,604.081091,435.298723
std,767.40562,299.882302
min,0.0,0.0
25%,82.15,184.6
50%,319.3,424.0
75%,844.55,640.85
max,5190.1,1528.1


``corr()`` shows the correlation matrix:

In [8]:
data[["U", "V"]].corr()

Unnamed: 0,U,V
U,1.0,0.551482
V,0.551482,1.0


## Selection

Selecting a single column, which yields a ``Series``:

In [9]:
V_variable = data.V

#which is the same as:
V_variable = data["V"]

In [10]:
type(V_variable)

pandas.core.series.Series

To select values:

[df.at](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html) can only access a single value at a time.

[df.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) can select multiple rows and/or columns.

In [11]:
data.loc[[3, 4, 5]] #indexes 3, 4 and 5 for all columns

Unnamed: 0,Id,X,Y,V,U,T
3,4.0,8.0,69.0,434.4,,2.0
4,5.0,9.0,90.0,412.1,,2.0
5,6.0,10.0,110.0,587.2,,2.0


In [12]:
data.at[4, "V"] #index 4 for variable V

412.1

## Setting

First, let`s create a capped U variable:

In [13]:
import numpy as np

In [14]:
U_cap = np.where(data["U"] > 2535, 2535, data["U"])

Now let`s create a new collumn data U capped in the DataFrame

In [15]:
data["U capped"] = U_cap

In [16]:
data.tail()

Unnamed: 0,Id,X,Y,V,U,T,U capped
465,466.0,214.0,19.0,242.5,15.6,2.0,15.6
466,467.0,245.0,231.0,161.2,26.1,2.0,26.1
467,468.0,233.0,220.0,626.0,959.7,2.0,959.7
468,469.0,226.0,221.0,800.1,1681.5,2.0,1681.5
469,470.0,213.0,218.0,482.6,476.2,2.0,476.2


``df.loc`` and ``df.at`` can be use to set values too

In [17]:
data.at[4, "V"] = 0

In [18]:
data.at[4, "V"]

0.0

## Missing data

To drop any rows that have missing data use ``df.dropna()``. This is specially usefull to filter an isotopic dataset.

In [19]:
data.dropna().head()

Unnamed: 0,Id,X,Y,V,U,T,U capped
195,196.0,40.0,71.0,76.2,1.1,2.0,1.1
196,197.0,21.0,69.0,284.3,7.8,2.0,7.8
197,198.0,28.0,80.0,606.8,105.3,2.0,105.3
198,199.0,29.0,59.0,772.7,1512.7,2.0,1512.7
199,200.0,41.0,81.0,269.5,9.8,2.0,9.8
