# INTRODUCCION A PANDAS (libreria)

## Pandas es una libreria open source construida sobre NumPy, nos permite analizar, limpiar y preparar datos de una manera fácil, es una version mejorada y mas de Excel. Tambien tiene sus propias caracteristicas de visualización.

_pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas is free software released under the three-clause BSD license.[2] The name is derived from the term "panel data", an econometrics term for multidimensional structured data sets._

Library features
- DataFrame object for data manipulation with integrated indexing.
- Tools for reading and writing data between in-memory data structures and different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of data sets.
- Label-based slicing, fancy indexing,[3] and subsetting of large data sets.
- Data structure column insertion and deletion.
- Group by engine allowing split-apply-combine operations on data sets.
- Data set merging and joining.
- Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
- Time series-functionality: Date range generation[4] and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.
- The library is highly optimized for performance, with critical code paths written in Cython or C.

In [6]:
# Podemos instalar pandas via "conda" o "pip"
# - $ conda install pandas
# - $ pip install pandas
# - $ pip3 install pandas

# Pandas maneja merge y permutacion tambien, series, Data-Frames, Concatenacion, operaciones asi datos I/O


import pandas

####
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://upload.wikimedia.org/wikipedia/commons/4/45/Pandas_logo.png")
#####



In [22]:
import pandas as pd

# SERIES

import numpy as np # para crear el arreglo
etiquetas = ['A','B','C']
my_data = [10,20,30]
arr = np.array(my_data)

arr

array([10, 20, 30])

In [23]:
etiquetas


['A', 'B', 'C']

In [24]:
d = {'a':10, 'b':20, 'c':30}

In [25]:
d

{'a': 10, 'b': 20, 'c': 30}

In [26]:
pd.Series(data=my_data) # para la ayuda es SHIFT-TAB
# importante el orden de la función, en esta caso la referencia es "my_data", etc...

0    10
1    20
2    30
dtype: int64

In [27]:
pd.Series(data=my_data, index=etiquetas)

A    10
B    20
C    30
dtype: int64

In [28]:
pd.Series(etiquetas,my_data)

10    A
20    B
30    C
dtype: object

In [29]:
# Divertido, dentro de esa serie de datos puedo llamar un objeto.

In [37]:
pd.Series(data=[sum,len,str,min,max])

0    <built-in function sum>
1    <built-in function len>
2              <class 'str'>
3    <built-in function min>
4    <built-in function max>
dtype: object

In [43]:
ser1 = pd.Series([1,2,3,4],['USA','MEXICO','ALEMANIA','JAPON'])
ser2 = pd.Series([1,2,3,4],['USA','MEXICO','ITALIA','JAPON'])
ser3 = pd.Series(data=etiquetas)
ser1

USA         1
MEXICO      2
ALEMANIA    3
JAPON       4
dtype: int64

In [44]:
ser2

USA       1
MEXICO    2
ITALIA    3
JAPON     4
dtype: int64

In [45]:
ser3

0    A
1    B
2    C
dtype: object

In [47]:
ser3[0] #acceso al dato

'A'

# DATAFRAME _la  herramienta principal de Pandas_

In [50]:
import numpy as np
import pandas as pd
from numpy.random import randn

# Crear el dataframe

In [58]:
np.random.seed(101) #Pseudo aleatoria, sin terminar la serie complete osea repetitiva

df = pd.DataFrame(randn(5,4), ['A','B','C','D','E'],['W','X','Y','Z']) # renglones y columnas (HEADS O CABECERAS 3ER PARAMETRO)

In [59]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [60]:
df['W'] #acceder a columna W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [61]:
type(df['W'])  # una serie o combinacion de la misma

pandas.core.series.Series

In [63]:
# para dos columnas
df[['W','X']]

Unnamed: 0,W,X
A,2.70685,0.628133
B,0.651118,-0.319318
C,-2.018168,0.740122
D,0.188695,-0.758872
E,0.190794,1.978757


In [65]:
# Actualizar el Dataframe
df['new']= df['W'] + df['X'] # sumas las 2 columnas y generar una nueva con el resultado

In [66]:
df


Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.334983
B,0.651118,-0.319318,-0.848077,0.605965,0.3318
C,-2.018168,0.740122,0.528813,-0.589001,-1.278046
D,0.188695,-0.758872,-0.933237,0.955057,-0.570177
E,0.190794,1.978757,2.605967,0.683509,2.169552


In [72]:
# Borrar la ultima entrada
# Tuplas de python, (5,6) donde 5 es renglon, 6 es columna ()
df.drop('new', axis=1, inplace=True)
# Pero debe aplicar el COMMIT para confirmar [inplace=1]

ValueError: labels ['new'] not contained in axis

In [73]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [74]:
df.shape


(5, 4)

In [76]:
df.loc['A'] # Traer todo el renglon

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [83]:
# Indice en numerico
df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [92]:
df.loc[['A','B'],['W','Y']].count() # el último es la funcion mid, count, max, min, etc.

W    2
Y    2
dtype: int64

In [94]:
df.loc[['A','B'],['W','Y']].pow(10)

Unnamed: 0,W,Y
A,21117.458169,0.380814
B,0.013696,0.192465


In [108]:
# SELECCION DE CONDICIONES
# Selección condicional usando notacion de "corchetes"

In [102]:
df > 0 # todos los que son mayores que cero

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [106]:
booldf = df > 0 # pasa el dato el variable
booldf

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [107]:
df[booldf] #descriminar datos  porque se puede mandar una variable. y 
# se pueden filtrar.
# Explisito es mejor que implicito

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [118]:
df[df['W'] >1].count() # contar un rango

W    1
X    1
Y    1
Z    1
dtype: int64

In [127]:
df[df['W'] > 0][['Y','X']].min() # contar con una condicion


Y   -0.933237
X   -0.758872
dtype: float64

In [128]:
#Ejercicio de condiciones

In [133]:
df[df['X'] > 0].count()  # para equal es ==

W    3
X    3
Y    3
Z    3
dtype: int64

In [136]:
df[df['X'] > 1]

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


In [159]:
df[(df['W']>0) & (df['Y'] > 1) ] ## para usar OR es | y AND es &

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509
