# Alumno: Gerardo de Miguel González

# Tarea de seguimiento Pandas

## Ejercicios de seguimiento: PANDAS

Crea un notebook de jupyter reproduciendo los ejemplos y comentarios del tutorial de 10 minutos PANDAS (1). El nombre del fichero para este notebook será TutorialPandas.ipynb. Subir también la versión HTML del notebok

---



(1) *Referencia*

 - [Tutorial 10min Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)



### 10 Minutes to pandas

You can see more complex recipes in the [Cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook).

In [0]:
#::GMG::So the basic imports are ...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Object Creation

See the [Data Structure Intro section](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro).

Creating a `Series` by passing a list of values, letting pandas create a default integer index:

In [7]:
s = pd.Series([1,3,5,np.nan,6,8])
print(s)
type(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


pandas.core.series.Series

Creating a `DataFrame` by passing a NumPy array, with a datetime index and labeled columns:

In [8]:
dates = pd.date_range('20130101', periods=6)
print(dates)
type(dates)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')


pandas.core.indexes.datetimes.DatetimeIndex

In [9]:
df = pd.DataFrame(np.random.randn(6,4), 
                  index=dates, 
                  columns=list('ABCD'))
print(df)
type(df)

                   A         B         C         D
2013-01-01 -0.326199 -0.344602 -0.695989  1.083588
2013-01-02 -0.047375  0.084995 -0.631822  0.244230
2013-01-03 -0.248123  0.176343  1.000584 -0.803270
2013-01-04 -1.129952 -0.378938  1.561718  0.855326
2013-01-05 -0.578888  0.266593 -0.075044  1.141016
2013-01-06 -0.660654 -1.417122  0.141583 -0.688961


pandas.core.frame.DataFrame

Creating a `DataFrame` by passing a `dict` of objects that can be converted to series-like.

In [10]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
print(df2)

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo


The *columns* of the resulting `DataFrame` have different `dtypes`.

In [11]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [12]:
df.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

#### Viewing data

See the [Basics section](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics).

Here is how to view the top and bottom rows of the frame:

In [13]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-05,-0.578888,0.266593,-0.075044,1.141016


In [14]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-05,-0.578888,0.266593,-0.075044,1.141016
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961


Display the index, columns, and the underlying NumPy data:

In [19]:
display(df.index, df2.index)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Int64Index([0, 1, 2, 3], dtype='int64')

In [20]:
display(df.columns, df2.columns)

Index(['A', 'B', 'C', 'D'], dtype='object')

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [21]:
display(df.values, df2.values)

array([[-0.32619948, -0.34460198, -0.69598933,  1.083588  ],
       [-0.04737536,  0.08499546, -0.63182185,  0.24423033],
       [-0.24812276,  0.17634263,  1.00058402, -0.80326973],
       [-1.1299519 , -0.37893831,  1.56171824,  0.85532553],
       [-0.57888782,  0.2665933 , -0.07504442,  1.14101602],
       [-0.66065421, -1.41712208,  0.1415834 , -0.68896116]])

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

[describe()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html#pandas.DataFrame.describe) shows a quick statistic summary of your data:

In [22]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.498532,-0.268788,0.216838,0.305321
std,0.381412,0.623937,0.90192,0.874974
min,-1.129952,-1.417122,-0.695989,-0.80327
25%,-0.640213,-0.370354,-0.492627,-0.455663
50%,-0.452544,-0.129803,0.033269,0.549778
75%,-0.267642,0.153506,0.785834,1.026522
max,-0.047375,0.266593,1.561718,1.141016


Transposing your data:

In [24]:
print(df.T)
#::GMG::no se cambia el objeto
print(df.index)
print(df.columns)

   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A   -0.326199   -0.047375   -0.248123   -1.129952   -0.578888   -0.660654
B   -0.344602    0.084995    0.176343   -0.378938    0.266593   -1.417122
C   -0.695989   -0.631822    1.000584    1.561718   -0.075044    0.141583
D    1.083588    0.244230   -0.803270    0.855326    1.141016   -0.688961
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')


In [25]:
#::GMG::Pero se puede ver que el traspuesto si cambia sus propiedades
print(df.T.index)
print(df.T.columns)

Index(['A', 'B', 'C', 'D'], dtype='object')
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')


Sorting by an axis:

In [27]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-05,-0.578888,0.266593,-0.075044,1.141016
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961


In [31]:
#::GMG::Ejes (axis) --> 0 "vertical", 1 "horizontal"
#       Lo que ordena aquí son las "columnas" en orden 
#       descendiente (alfabéticamente)
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,1.083588,-0.695989,-0.344602,-0.326199
2013-01-02,0.24423,-0.631822,0.084995,-0.047375
2013-01-03,-0.80327,1.000584,0.176343,-0.248123
2013-01-04,0.855326,1.561718,-0.378938,-1.129952
2013-01-05,1.141016,-0.075044,0.266593,-0.578888
2013-01-06,-0.688961,0.141583,-1.417122,-0.660654


In [32]:
#::GMG::Aquí ordena por las fechas que son el índice de las filas
df.sort_index(axis=0, ascending=False)

Unnamed: 0,A,B,C,D
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961
2013-01-05,-0.578888,0.266593,-0.075044,1.141016
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588


Sorting by values:

In [33]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-05,-0.578888,0.266593,-0.075044,1.141016


In [34]:
#::GMG::En la ayuda en línea (<TAB>) viene un ejemplo interesante con NaNs 
df_ex = pd.DataFrame({
          'col1' : ['A', 'A', 'B', np.nan, 'D', 'C'],
          'col2' : [2, 1, 9, 8, 7, 4],
          'col3': [0, 1, 9, 4, 2, 3],
          })
df_ex

Unnamed: 0,col1,col2,col3
0,A,2,0
1,A,1,1
2,B,9,9
3,,8,4
4,D,7,2
5,C,4,3


In [35]:
#::GMG::Sort by col1, NaN comes last
df_ex.sort_values(by=['col1'])

Unnamed: 0,col1,col2,col3
0,A,2,0
1,A,1,1
2,B,9,9
5,C,4,3
4,D,7,2
3,,8,4


#### Selection

**Note:** While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at, .iat, .loc and .iloc`.

See the indexing documentation [Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing) and [MultiIndex / Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced).

**Getting**

Selecting a single column, which yields a Series, equivalent to `df.A`:

In [37]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-05,-0.578888,0.266593,-0.075044,1.141016
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961


In [43]:
display(df['A'],df.A)
print('df:',type(df), '\ndf["A"]', type (df['A']))

2013-01-01   -0.326199
2013-01-02   -0.047375
2013-01-03   -0.248123
2013-01-04   -1.129952
2013-01-05   -0.578888
2013-01-06   -0.660654
Freq: D, Name: A, dtype: float64

2013-01-01   -0.326199
2013-01-02   -0.047375
2013-01-03   -0.248123
2013-01-04   -1.129952
2013-01-05   -0.578888
2013-01-06   -0.660654
Freq: D, Name: A, dtype: float64

df: <class 'pandas.core.frame.DataFrame'> 
df["A"] <class 'pandas.core.series.Series'>


Selecting via `[]`, which slices the rows.

In [44]:
#::GMG::the dates limits are INCLUDED (!)
#       numeric index start by zero and does not include the last index
display(df[0:3],df['20130102':'20130104'])

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327


Unnamed: 0,A,B,C,D
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-04,-1.129952,-0.378938,1.561718,0.855326


**Selection by Label**

See more in [Selection by Label](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label).

For getting a cross section using a label:

In [45]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-05,-0.578888,0.266593,-0.075044,1.141016
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961


In [56]:
#::GMG::First row ... weird (?) and heed the types (!)
display(df.loc[dates[0]],df[:1])
display(type(df.loc[dates[0]]),type(df[:1]))

A   -0.326199
B   -0.344602
C   -0.695989
D    1.083588
Name: 2013-01-01 00:00:00, dtype: float64

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588


pandas.core.series.Series

pandas.core.frame.DataFrame

Selecting on a multi-axis by label:

In [55]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-01-01,-0.326199,-0.344602
2013-01-02,-0.047375,0.084995
2013-01-03,-0.248123,0.176343
2013-01-04,-1.129952,-0.378938
2013-01-05,-0.578888,0.266593
2013-01-06,-0.660654,-1.417122


Showing label slicing, both endpoints are *included*:

In [59]:
display(df.loc['20130102':'20130104',['A','B']],
        type(df.loc['20130102':'20130104',['A','B']]))

Unnamed: 0,A,B
2013-01-02,-0.047375,0.084995
2013-01-03,-0.248123,0.176343
2013-01-04,-1.129952,-0.378938


pandas.core.frame.DataFrame

Reduction in the dimensions of the returned object:

In [58]:
display(df.loc['20130102',['A','B']],type(df.loc['20130102',['A','B']]))

A   -0.047375
B    0.084995
Name: 2013-01-02 00:00:00, dtype: float64

pandas.core.series.Series

For getting a scalar value:

In [60]:
display(df.loc[dates[0],'A'],type(df.loc[dates[0],'A']))

-0.32619947842139757

numpy.float64

For getting fast access to a scalar (equivalent to the prior method):

In [61]:
display(df.at[dates[0],'A'], type(df.at[dates[0],'A']))

-0.32619947842139757

numpy.float64