# Alumno: Gerardo de Miguel González

# Tarea de seguimiento Pandas

## Ejercicios de seguimiento: PANDAS

Crea un notebook de jupyter reproduciendo los ejemplos y comentarios del tutorial de 10 minutos PANDAS (1). El nombre del fichero para este notebook será TutorialPandas.ipynb. Subir también la versión HTML del notebok

---



(1) *Referencia*

 - [Tutorial 10min Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)



### 10 Minutes to pandas

You can see more complex recipes in the [Cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook).

In [0]:
#::GMG::So the basic imports are ...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Object Creation

See the [Data Structure Intro section](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro).

Creating a `Series` by passing a list of values, letting pandas create a default integer index:

In [0]:
s = pd.Series([1,3,5,np.nan,6,8])
print(s)
type(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


pandas.core.series.Series

Creating a `DataFrame` by passing a NumPy array, with a datetime index and labeled columns:

In [2]:
dates = pd.date_range('20130101', periods=6)
print(dates)
type(dates)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')


pandas.core.indexes.datetimes.DatetimeIndex

In [3]:
df = pd.DataFrame(np.random.randn(6,4), 
                  index=dates, 
                  columns=list('ABCD'))
print(df)
type(df)

                   A         B         C         D
2013-01-01 -1.061310  0.713476  0.244359  1.169532
2013-01-02  1.679106  0.677066 -1.083542 -0.396500
2013-01-03  0.277496 -0.297332 -0.394482  0.228910
2013-01-04 -0.055822 -2.341315 -0.102868 -0.993303
2013-01-05  1.733123 -0.567087  1.950235 -0.662319
2013-01-06  1.644846  0.810954  0.235825  0.156586


pandas.core.frame.DataFrame

Creating a `DataFrame` by passing a `dict` of objects that can be converted to series-like.

In [4]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
print(df2)

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo


The *columns* of the resulting `DataFrame` have different `dtypes`.

In [0]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [0]:
df.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

#### Viewing data

See the [Basics section](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics).

Here is how to view the top and bottom rows of the frame:

In [0]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-05,-0.578888,0.266593,-0.075044,1.141016


In [0]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-05,-0.578888,0.266593,-0.075044,1.141016
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961


Display the index, columns, and the underlying NumPy data:

In [0]:
display(df.index, df2.index)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Int64Index([0, 1, 2, 3], dtype='int64')

In [0]:
display(df.columns, df2.columns)

Index(['A', 'B', 'C', 'D'], dtype='object')

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [0]:
display(df.values, df2.values)

array([[-0.32619948, -0.34460198, -0.69598933,  1.083588  ],
       [-0.04737536,  0.08499546, -0.63182185,  0.24423033],
       [-0.24812276,  0.17634263,  1.00058402, -0.80326973],
       [-1.1299519 , -0.37893831,  1.56171824,  0.85532553],
       [-0.57888782,  0.2665933 , -0.07504442,  1.14101602],
       [-0.66065421, -1.41712208,  0.1415834 , -0.68896116]])

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

[describe()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html#pandas.DataFrame.describe) shows a quick statistic summary of your data:

In [0]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.498532,-0.268788,0.216838,0.305321
std,0.381412,0.623937,0.90192,0.874974
min,-1.129952,-1.417122,-0.695989,-0.80327
25%,-0.640213,-0.370354,-0.492627,-0.455663
50%,-0.452544,-0.129803,0.033269,0.549778
75%,-0.267642,0.153506,0.785834,1.026522
max,-0.047375,0.266593,1.561718,1.141016


Transposing your data:

In [0]:
print(df.T)
#::GMG::no se cambia el objeto
print(df.index)
print(df.columns)

   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A   -0.326199   -0.047375   -0.248123   -1.129952   -0.578888   -0.660654
B   -0.344602    0.084995    0.176343   -0.378938    0.266593   -1.417122
C   -0.695989   -0.631822    1.000584    1.561718   -0.075044    0.141583
D    1.083588    0.244230   -0.803270    0.855326    1.141016   -0.688961
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')


In [0]:
#::GMG::Pero se puede ver que el traspuesto si cambia sus propiedades
print(df.T.index)
print(df.T.columns)

Index(['A', 'B', 'C', 'D'], dtype='object')
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')


Sorting by an axis:

In [0]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-05,-0.578888,0.266593,-0.075044,1.141016
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961


In [0]:
#::GMG::Ejes (axis) --> 0 "vertical", 1 "horizontal"
#       Lo que ordena aquí son las "columnas" en orden 
#       descendiente (alfabéticamente)
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,1.083588,-0.695989,-0.344602,-0.326199
2013-01-02,0.24423,-0.631822,0.084995,-0.047375
2013-01-03,-0.80327,1.000584,0.176343,-0.248123
2013-01-04,0.855326,1.561718,-0.378938,-1.129952
2013-01-05,1.141016,-0.075044,0.266593,-0.578888
2013-01-06,-0.688961,0.141583,-1.417122,-0.660654


In [0]:
#::GMG::Aquí ordena por las fechas que son el índice de las filas
df.sort_index(axis=0, ascending=False)

Unnamed: 0,A,B,C,D
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961
2013-01-05,-0.578888,0.266593,-0.075044,1.141016
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588


Sorting by values:

In [0]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-05,-0.578888,0.266593,-0.075044,1.141016


In [0]:
#::GMG::En la ayuda en línea (<TAB>) viene un ejemplo interesante con NaNs 
df_ex = pd.DataFrame({
          'col1' : ['A', 'A', 'B', np.nan, 'D', 'C'],
          'col2' : [2, 1, 9, 8, 7, 4],
          'col3': [0, 1, 9, 4, 2, 3],
          })
df_ex

Unnamed: 0,col1,col2,col3
0,A,2,0
1,A,1,1
2,B,9,9
3,,8,4
4,D,7,2
5,C,4,3


In [0]:
#::GMG::Sort by col1, NaN comes last
df_ex.sort_values(by=['col1'])

Unnamed: 0,col1,col2,col3
0,A,2,0
1,A,1,1
2,B,9,9
5,C,4,3
4,D,7,2
3,,8,4


#### Selection

**Note:** While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at, .iat, .loc and .iloc`.

See the indexing documentation [Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing) and [MultiIndex / Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced).

**Getting**

Selecting a single column, which yields a Series, equivalent to `df.A`:

In [0]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-05,-0.578888,0.266593,-0.075044,1.141016
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961


In [0]:
display(df['A'],df.A)
print('df:',type(df), '\ndf["A"]', type (df['A']))

2013-01-01   -0.326199
2013-01-02   -0.047375
2013-01-03   -0.248123
2013-01-04   -1.129952
2013-01-05   -0.578888
2013-01-06   -0.660654
Freq: D, Name: A, dtype: float64

2013-01-01   -0.326199
2013-01-02   -0.047375
2013-01-03   -0.248123
2013-01-04   -1.129952
2013-01-05   -0.578888
2013-01-06   -0.660654
Freq: D, Name: A, dtype: float64

df: <class 'pandas.core.frame.DataFrame'> 
df["A"] <class 'pandas.core.series.Series'>


Selecting via `[]`, which slices the rows.

In [0]:
#::GMG::the dates limits are INCLUDED (!)
#       numeric index start by zero and does not include the last index
display(df[0:3],df['20130102':'20130104'])

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327


Unnamed: 0,A,B,C,D
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-04,-1.129952,-0.378938,1.561718,0.855326


**Selection by Label**

See more in [Selection by Label](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label).

For getting a cross section using a label:

In [0]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588
2013-01-02,-0.047375,0.084995,-0.631822,0.24423
2013-01-03,-0.248123,0.176343,1.000584,-0.80327
2013-01-04,-1.129952,-0.378938,1.561718,0.855326
2013-01-05,-0.578888,0.266593,-0.075044,1.141016
2013-01-06,-0.660654,-1.417122,0.141583,-0.688961


In [0]:
#::GMG::First row ... weird (?) and heed the types (!)
display(df.loc[dates[0]],df[:1])
display(type(df.loc[dates[0]]),type(df[:1]))

A   -0.326199
B   -0.344602
C   -0.695989
D    1.083588
Name: 2013-01-01 00:00:00, dtype: float64

Unnamed: 0,A,B,C,D
2013-01-01,-0.326199,-0.344602,-0.695989,1.083588


pandas.core.series.Series

pandas.core.frame.DataFrame

Selecting on a multi-axis by label:

In [0]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2013-01-01,-0.326199,-0.344602
2013-01-02,-0.047375,0.084995
2013-01-03,-0.248123,0.176343
2013-01-04,-1.129952,-0.378938
2013-01-05,-0.578888,0.266593
2013-01-06,-0.660654,-1.417122


Showing label slicing, both endpoints are *included*:

In [0]:
display(df.loc['20130102':'20130104',['A','B']],
        type(df.loc['20130102':'20130104',['A','B']]))

Unnamed: 0,A,B
2013-01-02,-0.047375,0.084995
2013-01-03,-0.248123,0.176343
2013-01-04,-1.129952,-0.378938


pandas.core.frame.DataFrame

Reduction in the dimensions of the returned object:

In [0]:
display(df.loc['20130102',['A','B']],type(df.loc['20130102',['A','B']]))

A   -0.047375
B    0.084995
Name: 2013-01-02 00:00:00, dtype: float64

pandas.core.series.Series

For getting a scalar value:

In [0]:
display(df.loc[dates[0],'A'],type(df.loc[dates[0],'A']))

-0.32619947842139757

numpy.float64

For getting fast access to a scalar (equivalent to the prior method):

In [0]:
display(df.at[dates[0],'A'], type(df.at[dates[0],'A']))

-0.32619947842139757

numpy.float64

**Selection by Position**

See more in [Selection by Position](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer).

Select via the position of the passed integers:

In [5]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.06131,0.713476,0.244359,1.169532
2013-01-02,1.679106,0.677066,-1.083542,-0.3965
2013-01-03,0.277496,-0.297332,-0.394482,0.22891
2013-01-04,-0.055822,-2.341315,-0.102868,-0.993303
2013-01-05,1.733123,-0.567087,1.950235,-0.662319
2013-01-06,1.644846,0.810954,0.235825,0.156586


In [8]:
display(df.iloc[3],type(df.iloc[3]))

A   -0.055822
B   -2.341315
C   -0.102868
D   -0.993303
Name: 2013-01-04 00:00:00, dtype: float64

pandas.core.series.Series

By integer slices, acting similar to numpy/python:

In [11]:
display(df.iloc[3:5,0:2],type(df.iloc[3:5,0:2]))

Unnamed: 0,A,B
2013-01-04,-0.055822,-2.341315
2013-01-05,1.733123,-0.567087


pandas.core.frame.DataFrame

By lists of integer position locations, similar to the numpy/python style:

In [12]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2013-01-02,1.679106,-1.083542
2013-01-03,0.277496,-0.394482
2013-01-05,1.733123,1.950235


For slicing rows explicitly:

In [13]:
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2013-01-02,1.679106,0.677066,-1.083542,-0.3965
2013-01-03,0.277496,-0.297332,-0.394482,0.22891


For slicing columns explicitly:

In [14]:
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,0.713476,0.244359
2013-01-02,0.677066,-1.083542
2013-01-03,-0.297332,-0.394482
2013-01-04,-2.341315,-0.102868
2013-01-05,-0.567087,1.950235
2013-01-06,0.810954,0.235825


For getting a value explicitly:

In [15]:
df.iloc[1,1]

0.6770657094133041

For getting fast access to a scalar (equivalent to the prior method):

In [16]:
df.iat[1,1]

0.6770657094133041

**Boolean Indexing**

Using a single column’s values to select data.

In [17]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.06131,0.713476,0.244359,1.169532
2013-01-02,1.679106,0.677066,-1.083542,-0.3965
2013-01-03,0.277496,-0.297332,-0.394482,0.22891
2013-01-04,-0.055822,-2.341315,-0.102868,-0.993303
2013-01-05,1.733123,-0.567087,1.950235,-0.662319
2013-01-06,1.644846,0.810954,0.235825,0.156586


In [18]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-02,1.679106,0.677066,-1.083542,-0.3965
2013-01-03,0.277496,-0.297332,-0.394482,0.22891
2013-01-05,1.733123,-0.567087,1.950235,-0.662319
2013-01-06,1.644846,0.810954,0.235825,0.156586


Selecting values from a DataFrame where a boolean condition is met.

In [19]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,0.713476,0.244359,1.169532
2013-01-02,1.679106,0.677066,,
2013-01-03,0.277496,,,0.22891
2013-01-04,,,,
2013-01-05,1.733123,,1.950235,
2013-01-06,1.644846,0.810954,0.235825,0.156586


Using the `isin()` method for filtering:

In [20]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.06131,0.713476,0.244359,1.169532
2013-01-02,1.679106,0.677066,-1.083542,-0.3965
2013-01-03,0.277496,-0.297332,-0.394482,0.22891
2013-01-04,-0.055822,-2.341315,-0.102868,-0.993303
2013-01-05,1.733123,-0.567087,1.950235,-0.662319
2013-01-06,1.644846,0.810954,0.235825,0.156586


In [21]:
df2 = df.copy()
df2

Unnamed: 0,A,B,C,D
2013-01-01,-1.06131,0.713476,0.244359,1.169532
2013-01-02,1.679106,0.677066,-1.083542,-0.3965
2013-01-03,0.277496,-0.297332,-0.394482,0.22891
2013-01-04,-0.055822,-2.341315,-0.102868,-0.993303
2013-01-05,1.733123,-0.567087,1.950235,-0.662319
2013-01-06,1.644846,0.810954,0.235825,0.156586


In [22]:
df2['E'] = ['one', 'one','two','three','four','three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,-1.06131,0.713476,0.244359,1.169532,one
2013-01-02,1.679106,0.677066,-1.083542,-0.3965,one
2013-01-03,0.277496,-0.297332,-0.394482,0.22891,two
2013-01-04,-0.055822,-2.341315,-0.102868,-0.993303,three
2013-01-05,1.733123,-0.567087,1.950235,-0.662319,four
2013-01-06,1.644846,0.810954,0.235825,0.156586,three


In [23]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2013-01-03,0.277496,-0.297332,-0.394482,0.22891,two
2013-01-05,1.733123,-0.567087,1.950235,-0.662319,four


**Setting**

Setting a new column automatically aligns the data by the indexes.

In [24]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [25]:
df['F'] = s1
df

Unnamed: 0,A,B,C,D,F
2013-01-01,-1.06131,0.713476,0.244359,1.169532,
2013-01-02,1.679106,0.677066,-1.083542,-0.3965,1.0
2013-01-03,0.277496,-0.297332,-0.394482,0.22891,2.0
2013-01-04,-0.055822,-2.341315,-0.102868,-0.993303,3.0
2013-01-05,1.733123,-0.567087,1.950235,-0.662319,4.0
2013-01-06,1.644846,0.810954,0.235825,0.156586,5.0


Setting values by label:

In [26]:
df.at[dates[0],'A'] = 0
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.713476,0.244359,1.169532,
2013-01-02,1.679106,0.677066,-1.083542,-0.3965,1.0
2013-01-03,0.277496,-0.297332,-0.394482,0.22891,2.0
2013-01-04,-0.055822,-2.341315,-0.102868,-0.993303,3.0
2013-01-05,1.733123,-0.567087,1.950235,-0.662319,4.0
2013-01-06,1.644846,0.810954,0.235825,0.156586,5.0


Setting values by position:

In [27]:
df.iat[0,1] = 0
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.244359,1.169532,
2013-01-02,1.679106,0.677066,-1.083542,-0.3965,1.0
2013-01-03,0.277496,-0.297332,-0.394482,0.22891,2.0
2013-01-04,-0.055822,-2.341315,-0.102868,-0.993303,3.0
2013-01-05,1.733123,-0.567087,1.950235,-0.662319,4.0
2013-01-06,1.644846,0.810954,0.235825,0.156586,5.0


Setting by assigning with a NumPy array:

In [28]:
df.loc[:,'D'] = np.array([5] * len(df))
display(len(df), np.array([5] * len(df)))
df

6

array([5, 5, 5, 5, 5, 5])

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.244359,5,
2013-01-02,1.679106,0.677066,-1.083542,5,1.0
2013-01-03,0.277496,-0.297332,-0.394482,5,2.0
2013-01-04,-0.055822,-2.341315,-0.102868,5,3.0
2013-01-05,1.733123,-0.567087,1.950235,5,4.0
2013-01-06,1.644846,0.810954,0.235825,5,5.0


A `where` operation with setting.

In [29]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.244359,-5,
2013-01-02,-1.679106,-0.677066,-1.083542,-5,-1.0
2013-01-03,-0.277496,-0.297332,-0.394482,-5,-2.0
2013-01-04,-0.055822,-2.341315,-0.102868,-5,-3.0
2013-01-05,-1.733123,-0.567087,-1.950235,-5,-4.0
2013-01-06,-1.644846,-0.810954,-0.235825,-5,-5.0


#### Missing Data

pandas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations. See the [Missing Data section](http://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data).

Reindexing allows you to *change/add/delete* the index on a specified axis. This returns a copy of the data.

In [30]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.244359,5,
2013-01-02,1.679106,0.677066,-1.083542,5,1.0
2013-01-03,0.277496,-0.297332,-0.394482,5,2.0
2013-01-04,-0.055822,-2.341315,-0.102868,5,3.0
2013-01-05,1.733123,-0.567087,1.950235,5,4.0
2013-01-06,1.644846,0.810954,0.235825,5,5.0


In [0]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1

In [34]:
display('df:',id(df),'df1:',id(df1))
df1

'df:'

140470032621184

'df1:'

140470024982480

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,0.244359,5,,1.0
2013-01-02,1.679106,0.677066,-1.083542,5,1.0,1.0
2013-01-03,0.277496,-0.297332,-0.394482,5,2.0,
2013-01-04,-0.055822,-2.341315,-0.102868,5,3.0,


To drop any rows that have missing data.

In [41]:
display(df1.dropna(how='any'), type(df1.dropna(how='any')))

Unnamed: 0,A,B,C,D,F,E
2013-01-02,1.679106,0.677066,-1.083542,5,1.0,1.0


pandas.core.frame.DataFrame

Filling missing data.

In [37]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,0.244359,5,5.0,1.0
2013-01-02,1.679106,0.677066,-1.083542,5,1.0,1.0
2013-01-03,0.277496,-0.297332,-0.394482,5,2.0,5.0
2013-01-04,-0.055822,-2.341315,-0.102868,5,3.0,5.0


To get the boolean mask where values are `nan`.

In [38]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,False,False,False,False,True,False
2013-01-02,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True
2013-01-04,False,False,False,False,False,True


#### Operations

See the [Basic section on Binary Ops](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-binop).

**Stats**

Operations in general *exclude* missing data.

Performing a descriptive statistic:

In [39]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.244359,5,
2013-01-02,1.679106,0.677066,-1.083542,5,1.0
2013-01-03,0.277496,-0.297332,-0.394482,5,2.0
2013-01-04,-0.055822,-2.341315,-0.102868,5,3.0
2013-01-05,1.733123,-0.567087,1.950235,5,4.0
2013-01-06,1.644846,0.810954,0.235825,5,5.0


In [40]:
display(df.mean(), type(df.mean()))

A    0.879792
B   -0.286286
C    0.141588
D    5.000000
F    3.000000
dtype: float64

pandas.core.series.Series

Same operation on the other axis:

In [42]:
display(df.mean(1), type(df.mean(1)))

2013-01-01    1.311090
2013-01-02    1.454526
2013-01-03    1.317136
2013-01-04    1.099999
2013-01-05    2.423254
2013-01-06    2.538325
Freq: D, dtype: float64

pandas.core.series.Series

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension.

In [43]:
#::GMG::shift (?)
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.shift.html
# Err ... sorry, but I don't get it :(
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [44]:
s_1 = pd.Series([1,3,5,np.nan,6,8], index=dates)
s_1

2013-01-01    1.0
2013-01-02    3.0
2013-01-03    5.0
2013-01-04    NaN
2013-01-05    6.0
2013-01-06    8.0
Freq: D, dtype: float64

In [45]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.244359,5,
2013-01-02,1.679106,0.677066,-1.083542,5,1.0
2013-01-03,0.277496,-0.297332,-0.394482,5,2.0
2013-01-04,-0.055822,-2.341315,-0.102868,5,3.0
2013-01-05,1.733123,-0.567087,1.950235,5,4.0
2013-01-06,1.644846,0.810954,0.235825,5,5.0


In [46]:
df.sub(s, axis = 'index')

Unnamed: 0,A,B,C,D,F
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,-0.722504,-1.297332,-1.394482,4.0,1.0
2013-01-04,-3.055822,-5.341315,-3.102868,2.0,0.0
2013-01-05,-3.266877,-5.567087,-3.049765,0.0,-1.0
2013-01-06,,,,,


In [47]:
df.sub(s_1, axis = 'index')

Unnamed: 0,A,B,C,D,F
2013-01-01,-1.0,-1.0,-0.755641,4.0,
2013-01-02,-1.320894,-2.322934,-4.083542,2.0,-2.0
2013-01-03,-4.722504,-5.297332,-5.394482,0.0,-3.0
2013-01-04,,,,,
2013-01-05,-4.266877,-6.567087,-4.049765,-1.0,-2.0
2013-01-06,-6.355154,-7.189046,-7.764175,-3.0,-3.0


**Apply**

Applying functions to the data:

::GMG::**note** just curious :)

 - [Google: pandas apply versus R apply](https://www.google.com/search?q=pandas+apply+versus+R+apply&ie=utf-8&oe=utf-8&client=firefox-b-ab)
 - [pandas.DataFrame.apply](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html)
 - [R documentation: Apply](https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/apply)
 - [Stackoverflow: Python vs. R: apply a function to each element in a vector](https://stackoverflow.com/questions/41170762/python-vs-r-apply-a-function-to-each-element-in-a-vector)
 - [Apply Function in R – apply vs lapply vs sapply vs mapply vs tapply vs rapply vs vapply](http://www.datasciencemadesimple.com/apply-function-r/)

In [48]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.244359,5,
2013-01-02,1.679106,0.677066,-1.083542,5,1.0
2013-01-03,0.277496,-0.297332,-0.394482,5,2.0
2013-01-04,-0.055822,-2.341315,-0.102868,5,3.0
2013-01-05,1.733123,-0.567087,1.950235,5,4.0
2013-01-06,1.644846,0.810954,0.235825,5,5.0


In [49]:
#::GMG::np.cumsum (?)
# https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.cumsum.html
# https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.cumsum.html
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,0.244359,5,
2013-01-02,1.679106,0.677066,-0.839183,10,1.0
2013-01-03,1.956602,0.379734,-1.233665,15,3.0
2013-01-04,1.90078,-1.961581,-1.336532,20,6.0
2013-01-05,3.633903,-2.528668,0.613703,25,10.0
2013-01-06,5.278749,-1.717714,0.849529,30,15.0


In [50]:
#::GMG::lamda (?)
# https://www.google.com/search?q=python+lambda&ie=utf-8&oe=utf-8&client=firefox-b-ab
df.apply(lambda x: x.max() - x.min())

A    1.788945
B    3.152269
C    3.033777
D    0.000000
F    4.000000
dtype: float64

**Histogramming**

See more at [Histogramming and Discretization](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-discretization).

In [51]:
s = pd.Series(np.random.randint(0, 7, size=10))
display(s,'Counts',s.value_counts())

0    5
1    6
2    2
3    0
4    6
5    4
6    2
7    2
8    1
9    0
dtype: int64

'Counts'

2    3
6    2
0    2
5    1
4    1
1    1
dtype: int64

**String Methods**

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in *str* generally uses [regular expressions](https://docs.python.org/3/library/re.html) by default (and in some cases always uses them). See more at [Vectorized String Methods](http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods).

In [52]:
s = pd.Series(
    ['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']
    )
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

#### Merge

**Concat**

pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

See the [Merging section](http://pandas.pydata.org/pandas-docs/stable/merging.html#merging).

Concatenating pandas objects together with [`concat()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html#pandas.concat):