# Unit 2 Optimal Data Exploration Through Pandas

## Unit 2.2 Pandas Series and DataFrames

In [2]:
import numpy as np
import pandas as pd

Vamos a crear una serie pandas a partir de una lista:

In [3]:
pd.Series(['Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Female'])

0      Male
1    Female
2      Male
3      Male
4    Female
5    Female
6    Female
dtype: object

También es posible crear una serie a partir de un ndarray Numpy:

In [4]:
ser = pd.Series(np.array(['Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Female']))
ser

0      Male
1    Female
2      Male
3      Male
4    Female
5    Female
6    Female
dtype: object

In [5]:
type(ser)

pandas.core.series.Series

Veamos cómo crear una serie Pandas a partir e un diccionario:

In [6]:
dict_data = {'a':1, 'b':2, 'c':3}

In [7]:
ser1 = pd.Series(dict_data)
ser1

a    1
b    2
c    3
dtype: int64

In [8]:
type(ser1)

pandas.core.series.Series

In [9]:
ser1.index

Index(['a', 'b', 'c'], dtype='object')

In [10]:
ser1.values

array([1, 2, 3])

Veamos cómo es el índice de una serie creada a partir de una lista:

In [11]:
list_data = ['2019-01-02', 3.14, 'ABC', 100, True]

In [12]:
ser2 = pd.Series(list_data)
ser2

0    2019-01-02
1          3.14
2           ABC
3           100
4          True
dtype: object

In [13]:
ser2.index

RangeIndex(start=0, stop=5, step=1)

In [14]:
ser2.values

array(['2019-01-02', 3.14, 'ABC', 100, True], dtype=object)

In [15]:
type(ser2.values)

numpy.ndarray

Ahora vamos a crear DataFrames:

In [16]:
df = pd.DataFrame({'Name':['Braund, Mr. Owen Harris',
                          'Allen, Mr. William Henry',
                          'Bonnell, Miss. Elizabeth'],
                  'Age':[22, 35, 58],
                  'Sex':['male','male','female']})
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


podemos acceder a una columna con su nombre:

In [17]:
df['Age']

0    22
1    35
2    58
Name: Age, dtype: int64

In [18]:
type(df['Age']) # una columna es una serie Pandas

pandas.core.series.Series

Accediendo a un elemento de la serie podemos localizar un elemento del DataFrame con columna y fila:

In [19]:
df['Age'][0]

22

También podemos crear un DataFrame a partir de una lista de listas:

In [20]:
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9


In [21]:
df.columns

RangeIndex(start=0, stop=3, step=1)

In [22]:
df[0]

0    1
1    4
2    7
Name: 0, dtype: int64

In [23]:
type(df[0])

pandas.core.series.Series

In [24]:
eye = pd.Series([220,215,93,64],
                index=['Brown', 'Blue', 'Hazel', 'Green'],
               name='eye_color')
eye

Brown    220
Blue     215
Hazel     93
Green     64
Name: eye_color, dtype: int64

In [25]:
eye.name

'eye_color'

In [26]:
eye.index

Index(['Brown', 'Blue', 'Hazel', 'Green'], dtype='object')

In [27]:
eye.values

array([220, 215,  93,  64])

In [28]:
eye.sort_values()

Green     64
Hazel     93
Blue     215
Brown    220
Name: eye_color, dtype: int64

In [29]:
eye.sort_index()

Blue     215
Brown    220
Green     64
Hazel     93
Name: eye_color, dtype: int64

In [30]:
eye.unique()

array([220, 215,  93,  64])

In [31]:
eye.nunique()

4

Vamos a ver qué sucede si duplicamos valores (nombres de índice repetidos)

In [32]:
my_data2 = [220, 215, 93, 64, 64]

In [33]:
eye2 = pd.Series(data=my_data2, index=['Brown', 'Blue', 'Blue', 'Hazel', 'Green'])
eye2

Brown    220
Blue     215
Blue      93
Hazel     64
Green     64
dtype: int64

In [34]:
eye2.unique()

array([220, 215,  93,  64])

In [35]:
eye2.nunique()

4

In [36]:
eye2.value_counts()

64     2
220    1
215    1
93     1
Name: count, dtype: int64

Con las series es posible indexar y realizar *slices* igual que con los arrays Numpy:

In [37]:
ser = pd.Series([0, 10, 20, 30, 40], index=['a', 'b', 'c', 'd', 'e'])
ser

a     0
b    10
c    20
d    30
e    40
dtype: int64

In [38]:
ser[1] # observa que 1 no forma parte de los índices, se refiere a la 2a posición

  ser[1] # observa que 1 no forma parte de los índices, se refiere a la 2a posición


10

In [39]:
ser['b']

10

In [40]:
ser[1:3]

b    10
c    20
dtype: int64

In [41]:
ser['b':'c'] # OPSSSSSSS en este slice el último sí está incluído 🤦‍♂️

b    10
c    20
dtype: int64

### Operaciones con series

In [42]:
ser1 = pd.Series([0,1,2,3,4], index=[0,1,2,3,4])
ser1

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [43]:
ser2 = pd.Series([0,1,2,3,4], index=[4,3,2,1,0])
ser2

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [44]:
ser1+ser2

0    4
1    4
2    4
3    4
4    4
dtype: int64

In [45]:
ser1*ser2

0    0
1    3
2    4
3    3
4    0
dtype: int64

In [46]:
ser1/ser2

0    0.000000
1    0.333333
2    1.000000
3    3.000000
4         inf
dtype: float64

In [47]:
ser1.sum()

10

In [48]:
ser1.mean()

2.0

In [49]:
ser1.median()

2.0

In [50]:
ser1.max()

4

In [51]:
ser1.min()

0

In [52]:
ser1.std()

1.5811388300841898

In [53]:
ser2

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [54]:
ser2.sort_values()

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [55]:
ser2.sort_index()

0    4
1    3
2    2
3    1
4    0
dtype: int64

Aplicar funciones a series:

In [56]:
ser_height = pd.Series([160,170,180], name='height')
ser_height

0    160
1    170
2    180
Name: height, dtype: int64

In [57]:
def plus_10(x):
    return x+10

In [58]:
plus_10(5)

15

In [59]:
ser_height.apply(plus_10)

0    170
1    180
2    190
Name: height, dtype: int64

In [60]:
# también podemos poner una función lambda como argumento del método apply:
ser_height.apply(lambda x : x+10)

0    170
1    180
2    190
Name: height, dtype: int64

### Practicando con Series y DataFrames

In [66]:
iris_df = pd.read_csv('data_iris.csv')
iris_df

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [67]:
type(iris_df)

pandas.core.frame.DataFrame

In [68]:
# recordemos que es posible construir un DataFrame a partir de un diccionario:
data = {'NAME':['Jake', 'Jeniffer', 'Paul', 'Andrew'],
        'AGE':[24,21,25,19],
        'GENDER':['M', 'F', 'M', 'M']}

In [69]:
df = pd.DataFrame(data)
df

Unnamed: 0,NAME,AGE,GENDER
0,Jake,24,M
1,Jeniffer,21,F
2,Paul,25,M
3,Andrew,19,M


In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   NAME    4 non-null      object
 1   AGE     4 non-null      int64 
 2   GENDER  4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes


In [71]:
iris_df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [72]:
iris_df.head(3)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [73]:
iris_df.tail()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [74]:
iris_df.tail(3)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [75]:
iris_df.sample(4)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
80,5.5,2.4,3.8,1.1,versicolor
27,5.2,3.5,1.5,0.2,setosa
107,7.3,2.9,6.3,1.8,virginica
60,5.0,2.0,3.5,1.0,versicolor


In [76]:
iris_df.columns

Index(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width',
       'Species'],
      dtype='object')

In [77]:
iris_df.index

RangeIndex(start=0, stop=150, step=1)

In [82]:
type(iris_df['Sepal_Length'])

pandas.core.series.Series

In [79]:
# podemos cambiar el nombre de las columnas:

iris_df.columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Lenght', 'Petal_Width', 'Species']

In [83]:
iris_df.head()

Unnamed: 0,Sepal_Length,Sepal_Width,Petal_Lenght,Petal_Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [84]:
# una posible forma de indexar una columna:
iris_df.Sepal_Length

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: Sepal_Length, Length: 150, dtype: float64

In [85]:
# otra forma que funciona incluso si el nombre de la columna tiene espacios, etc.
iris_df['Sepal_Length']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: Sepal_Length, Length: 150, dtype: float64

In [87]:
# podemos acceder a más de una columna, PERO OBSERVA que devuelve un DataFrame:
iris_df[['Sepal_Length','Sepal_Width']]

Unnamed: 0,Sepal_Length,Sepal_Width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


In [88]:
# en particular, devuelve un DataFrame incluso si le pasamos una única columna:
iris_df[['Sepal_Length']]

Unnamed: 0,Sepal_Length
0,5.1
1,4.9
2,4.7
3,4.6
4,5.0
...,...
145,6.7
146,6.3
147,6.5
148,6.2


También podemos crear un DataFrame a partir de un array bidimensional:

In [89]:
array = [['Name', 'Age', 'Sex', 'School'],
         ['Tom', 15, 'Male', 'middle'],
         ['Alice', 10, 'Female', 'elementary']]

In [90]:
df = pd.DataFrame(array)
df # observa la primera columna:

Unnamed: 0,0,1,2,3
0,Name,Age,Sex,School
1,Tom,15,Male,middle
2,Alice,10,Female,elementary


In [95]:
df = pd.DataFrame([['Alice', 10, 'Female', 'elementary'],
                   ['Tom', 15, 'Male', 'middle']],
                  columns=['Name', 'Age', 'Sex', 'School'],index=['stu1', 'stu2'])
df

Unnamed: 0,Name,Age,Sex,School
stu1,Alice,10,Female,elementary
stu2,Tom,15,Male,middle


In [96]:
df.index

Index(['stu1', 'stu2'], dtype='object')

In [97]:
df.index = ['stu1', 'stu2']
df

Unnamed: 0,Name,Age,Sex,School
stu1,Alice,10,Female,elementary
stu2,Tom,15,Male,middle


In [98]:
df.index # slide 202

Index(['stu1', 'stu2'], dtype='object')

In [99]:
# de forma similar podemos modificar los nombres de las columnas:

df.columns = ['student_name', 'years', 'sex2', 'school2']
df

Unnamed: 0,student_name,years,sex2,school2
stu1,Alice,10,Female,elementary
stu2,Tom,15,Male,middle


In [100]:
df.rename(index={'stu1':'student1'}) # devuelve otro DataFrame:

Unnamed: 0,student_name,years,sex2,school2
student1,Alice,10,Female,elementary
stu2,Tom,15,Male,middle


In [132]:
df # pero el DataFrame original no ha cambiado:

Unnamed: 0,student_name,years,sex2,school2
stu1,Alice,10,Female,elementary
stu2,Tom,15,Male,middle


In [101]:
# vamos a repetir pero ahora utilizamos inplace=True

df.rename(index={'stu1':'student1'}, inplace=True) # ops! observa que no devuelve nada

In [102]:
df # pero ahora df ha cambiado

Unnamed: 0,student_name,years,sex2,school2
student1,Alice,10,Female,elementary
stu2,Tom,15,Male,middle


Una alternativa a utilizar `inplace=True` es reasignar el valor devuelto a la variable

> **Ojo:** Esto no es lo mismo si hay varias variables referenciando al DataFrame

In [103]:
df = df.rename(index={'stu2':'student2'})

In [104]:
df

Unnamed: 0,student_name,years,sex2,school2
student1,Alice,10,Female,elementary
student2,Tom,15,Male,middle


In [105]:
# con `rename` también podemos reemplazar nombres de columnas:
df.rename(columns={'student_name':'stu_name'}, inplace=True)

In [106]:
df

Unnamed: 0,stu_name,years,sex2,school2
student1,Alice,10,Female,elementary
student2,Tom,15,Male,middle


In [108]:
# podemos cambiar más de un nombre y filas y columnas al mismo tiempo.
# ¿qué pasa si una entrada no existe? vamos a verlo:

df.rename(index={'student1':'estudiante1', 'student2':'estudiante2', 'student3':'estudiante3'},
          columns={'sex2':'sex', 'school2':'school'})

# como no hemos puesto inplace=True devolverá un nuevo DataFrame pero df no sufrirá cambios

Unnamed: 0,stu_name,years,sex,school
estudiante1,Alice,10,Female,elementary
estudiante2,Tom,15,Male,middle


In [113]:
def f(x):
    return x.upper()

In [114]:
df.rename(columns=f)
df

Unnamed: 0,MATH,ENG,MUSIC,SCIENCE
stu1,90,98,85,100
stu2,80,89,95,90
stu3,70,95,100,90


### Borrando filas y columnas

Utilizaremos el método `drop`:

In [128]:
exam_data = {'math'   : [ 90,  80,  70],
             'eng'    : [ 98,  89,  95],
             'music'  : [ 85,  95, 100],
             'science': [100,  90,  90]}

In [129]:
type(exam_data)

dict

In [130]:
df = pd.DataFrame(exam_data, index=['stu1', 'stu2', 'stu3'])
df

Unnamed: 0,math,eng,music,science
stu1,90,98,85,100
stu2,80,89,95,90
stu3,70,95,100,90


In [131]:
# en las slides copian con un slice poniendo df2=df[:] 
# luego veremos que no es lo mismo exactamente
df2 = df.copy()

In [132]:
df2

Unnamed: 0,math,eng,music,science
stu1,90,98,85,100
stu2,80,89,95,90
stu3,70,95,100,90


In [133]:
df2.drop('stu1', inplace=True) # nuevamente, un inplace=True hace que no devuelva nada

In [135]:
df2.drop(['math','eng'], axis=1)
df2

Unnamed: 0,math,eng,music,science
stu2,80,89,95,90
stu3,70,95,100,90


In [136]:
df2

Unnamed: 0,math,eng,music,science
stu2,80,89,95,90
stu3,70,95,100,90


In [137]:
df # el original

Unnamed: 0,math,eng,music,science
stu1,90,98,85,100
stu2,80,89,95,90
stu3,70,95,100,90


In [138]:
# obviamente podemos borrar varias filas al mismo tiempo:
df3 = df.drop(['stu2','stu3'], axis=0) # observa que NO hemos puesto inplace=True
df3

Unnamed: 0,math,eng,music,science
stu1,90,98,85,100


In [139]:
df

Unnamed: 0,math,eng,music,science
stu1,90,98,85,100
stu2,80,89,95,90
stu3,70,95,100,90


Observa las diferencias entre copiar con [:] y con .copy()

El primero da un warning

```
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
```

In [141]:
df4 = df[:]
df4.drop('math', axis=1, inplace=True)
df4

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4.drop('math', axis=1, inplace=True)


Unnamed: 0,eng,music,science
stu1,98,85,100
stu2,89,95,90
stu3,95,100,90


Ahora lo repetimos utilizando `.copy()`:

In [146]:
df4 = df.copy()
df4.drop(['math','eng'], axis=1, inplace=True)
df4

Unnamed: 0,music,science
stu1,85,100
stu2,95,90
stu3,100,90


In [143]:
df # la original sigue con la columna borrada en df4

Unnamed: 0,math,eng,music,science
stu1,90,98,85,100
stu2,80,89,95,90
stu3,70,95,100,90


In [144]:
# podemos borrar varias columnas:
df5 = df.drop(['eng','science'], axis=1, inplace=False) # o no poner inplace
df5

Unnamed: 0,math,music
stu1,90,85
stu2,80,95
stu3,70,100


### Seleccionar elementos

In [145]:
df

Unnamed: 0,math,eng,music,science
stu1,90,98,85,100
stu2,80,89,95,90
stu3,70,95,100,90


In [152]:
#loc devuelve una serie, en concreto la del índice seleccionado
#iloc para localizar por índice numérico
df.loc['stu1']

math        90
eng         98
music       85
science    100
Name: stu1, dtype: int64

In [150]:
type(df.loc['stu1'])

pandas.core.series.Series

In [156]:
df.index


Index(['stu1', 'stu2', 'stu3'], dtype='object')

In [153]:
df.iloc[0]

math        90
eng         98
music       85
science    100
Name: stu1, dtype: int64

In [159]:
#Colocamos al lado otra lista con columnas que queremos filtrar
df.loc[['stu1','stu3'],['math','music']] # no necesitan ser filas contiguas

Unnamed: 0,math,music
stu1,90,85
stu3,70,100


In [158]:
df.iloc[[0,2]]

Unnamed: 0,math,eng,music,science
stu1,90,98,85,100
stu3,70,95,100,90


In [165]:
#Buscar valores con booleanos
df.loc[df['math']>75]

Unnamed: 0,math,eng,music,science
stu1,90,98,85,100
stu2,80,89,95,90


Seleccionar columnas

In [160]:
math1 = df['math']
math1

stu1    90
stu2    80
stu3    70
Name: math, dtype: int64

In [161]:
english = df.eng
english

stu1    98
stu2    89
stu3    95
Name: eng, dtype: int64

In [166]:
music_sci = df[['music','science']]
music_sci

Unnamed: 0,music,science
stu1,85,100
stu2,95,90
stu3,100,90


In [167]:
df.math

stu1    90
stu2    80
stu3    70
Name: math, dtype: int64

In [168]:
df['math']

stu1    90
stu2    80
stu3    70
Name: math, dtype: int64

In [169]:
df[['math']]

Unnamed: 0,math
stu1,90
stu2,80
stu3,70


In [173]:
exam_data = {'name':['honglidong', 'hongeedong', 'hongsamdong'],
            'math':[90, 80, 70],
            'eng':[98, 89, 95],
            'music':[85, 95, 100],
            'phys_tra':[100, 90, 90]}

In [175]:
df = pd.DataFrame(exam_data)
df

Unnamed: 0,name,math,eng,music,phys_tra
0,honglidong,90,98,85,100
1,hongeedong,80,89,95,90
2,hongsamdong,70,95,100,90


In [176]:
df.loc[df.math>df.music]

Unnamed: 0,name,math,eng,music,phys_tra
0,honglidong,90,98,85,100


In [177]:
df.index

RangeIndex(start=0, stop=3, step=1)

### Designar un índice

Podemos utilizar una de las columnas como índice:

In [178]:
df.set_index('name', inplace=True)

In [181]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, honglidong to hongsamdong
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   math      3 non-null      int64
 1   eng       3 non-null      int64
 2   music     3 non-null      int64
 3   phys_tra  3 non-null      int64
dtypes: int64(4)
memory usage: 228.0+ bytes


In [179]:
df

Unnamed: 0_level_0,math,eng,music,phys_tra
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
honglidong,90,98,85,100
hongeedong,80,89,95,90
hongsamdong,70,95,100,90


In [180]:
df.loc['honglidong']

math         90
eng          98
music        85
phys_tra    100
Name: honglidong, dtype: int64

In [185]:
df.loc['honglidong','music']

85

In [186]:
df.loc['honglidong',['music','phys_tra']]

music        85
phys_tra    100
Name: honglidong, dtype: int64

In [187]:
df.loc[['honglidong'],['music','phys_tra']]

Unnamed: 0_level_0,music,phys_tra
name,Unnamed: 1_level_1,Unnamed: 2_level_1
honglidong,85,100


In [188]:
df.loc[['honglidong','hongeedong'],['music','phys_tra']]

Unnamed: 0_level_0,music,phys_tra
name,Unnamed: 1_level_1,Unnamed: 2_level_1
honglidong,85,100
hongeedong,95,90


In [189]:
df.iloc[0,[2,3]]

music        85
phys_tra    100
Name: honglidong, dtype: int64

In [192]:
#Se pueden aplicar slices
df.iloc[[2,0],2:3]

Unnamed: 0_level_0,music
name,Unnamed: 1_level_1
hongsamdong,100
honglidong,85


Podemos usar *slices*, observa la diferencia entre `loc` e `iloc`:

In [193]:
df.loc['honglidong', 'eng':'phys_tra'] # el punto final del slice SÍ está incluido:

eng          98
music        85
phys_tra    100
Name: honglidong, dtype: int64

In [199]:
df

Unnamed: 0_level_0,math,eng,music,phys_tra,kor
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
honglidong,90,98,85,100,80
hongeedong,80,89,95,90,80
hongsamdong,70,95,100,90,80


In [202]:
df.iloc[0,1:]

eng          98
music        85
phys_tra    100
kor          80
Name: honglidong, dtype: int64

In [195]:
# la columna de índice 3 no ha sido incluida, si accedemos únicamente a 3 sin slice sí que sale:
df.iloc[0,3]

100

In [196]:
# otro ejemplo de slice:
df.loc[['honglidong','hongeedong'],'music':'phys_tra']

Unnamed: 0_level_0,music,phys_tra
name,Unnamed: 1_level_1,Unnamed: 2_level_1
honglidong,85,100
hongeedong,95,90


### Añadir columnas

In [197]:
df['kor'] = 80
df # observa que en la nueva columna todos los valores valen 80

Unnamed: 0_level_0,math,eng,music,phys_tra,kor
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
honglidong,90,98,85,100,80
hongeedong,80,89,95,90,80
hongsamdong,70,95,100,90,80


In [211]:
df

Unnamed: 0_level_0,math,eng,music,phys_tra,kor,media,product
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
honglidong,90,98,85,100,80,90.6,2.952646e+23
hongeedong,80,89,95,90,80,86.8,1.786947e+23
hongsamdong,70,95,100,90,80,87.0,1.735189e+23


In [214]:
#Crear columna que multiplique todas las columnas entre sí


df.drop('product',axis=1)

Unnamed: 0_level_0,math,eng,music,phys_tra,kor,media
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
honglidong,90,98,85,100,80,90.6
hongeedong,80,89,95,90,80,86.8
hongsamdong,70,95,100,90,80,87.0


### Añadir filas

In [215]:
df.index

Index(['honglidong', 'hongeedong', 'hongsamdong'], dtype='object', name='name')

In [216]:
df.loc['Juan'] = 0
df

Unnamed: 0_level_0,math,eng,music,phys_tra,kor,media,product
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
honglidong,90,98,85,100,80,90.6,2.952646e+23
hongeedong,80,89,95,90,80,86.8,1.786947e+23
hongsamdong,70,95,100,90,80,87.0,1.735189e+23
Juan,0,0,0,0,0,0.0,0.0


In [217]:
df.loc[2] = 0 # observa que aunque pongamos un 2 es .loc
df

Unnamed: 0_level_0,math,eng,music,phys_tra,kor,media,product
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
honglidong,90,98,85,100,80,90.6,2.952646e+23
hongeedong,80,89,95,90,80,86.8,1.786947e+23
hongsamdong,70,95,100,90,80,87.0,1.735189e+23
Juan,0,0,0,0,0,0.0,0.0
2,0,0,0,0,0,0.0,0.0


In [221]:
df.loc[2] = [90, 80, 70, 60, 77,0,0] # ya existe una columna con ese nombre
df

Unnamed: 0_level_0,math,eng,music,phys_tra,kor,media,product
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
honglidong,90,98,85,100,80,90.6,2.952646e+23
hongeedong,80,89,95,90,80,86.8,1.786947e+23
hongsamdong,70,95,100,90,80,87.0,1.735189e+23
Juan,0,0,0,0,0,0.0,0.0
2,90,80,70,60,77,0.0,0.0


In [222]:
df.index # al

Index(['honglidong', 'hongeedong', 'hongsamdong', 'Juan', 2], dtype='object', name='name')

In [223]:
df.index[3]

'Juan'

### Resetar el índice

In [224]:
df

Unnamed: 0_level_0,math,eng,music,phys_tra,kor,media,product
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
honglidong,90,98,85,100,80,90.6,2.952646e+23
hongeedong,80,89,95,90,80,86.8,1.786947e+23
hongsamdong,70,95,100,90,80,87.0,1.735189e+23
Juan,0,0,0,0,0,0.0,0.0
2,90,80,70,60,77,0.0,0.0


In [228]:
df.reset_index()

Unnamed: 0,index,name,math,eng,music,phys_tra,kor,media,product
0,0,honglidong,90,98,85,100,80,90.6,2.952646e+23
1,1,hongeedong,80,89,95,90,80,86.8,1.786947e+23
2,2,hongsamdong,70,95,100,90,80,87.0,1.735189e+23
3,3,Juan,0,0,0,0,0,0.0,0.0
4,4,2,90,80,70,60,77,0.0,0.0


In [229]:
df

Unnamed: 0,name,math,eng,music,phys_tra,kor,media,product
0,honglidong,90,98,85,100,80,90.6,2.952646e+23
1,hongeedong,80,89,95,90,80,86.8,1.786947e+23
2,hongsamdong,70,95,100,90,80,87.0,1.735189e+23
3,Juan,0,0,0,0,0,0.0,0.0
4,2,90,80,70,60,77,0.0,0.0


In [230]:
df.reset_index(inplace=True) # admite inplace=True

In [231]:
df

Unnamed: 0,index,name,math,eng,music,phys_tra,kor,media,product
0,0,honglidong,90,98,85,100,80,90.6,2.952646e+23
1,1,hongeedong,80,89,95,90,80,86.8,1.786947e+23
2,2,hongsamdong,70,95,100,90,80,87.0,1.735189e+23
3,3,Juan,0,0,0,0,0,0.0,0.0
4,4,2,90,80,70,60,77,0.0,0.0


**¡CUIDADO!** si vuelves a hacer reset_index crea una columna...

In [232]:
df_prueba = df.copy()
df_prueba

Unnamed: 0,index,name,math,eng,music,phys_tra,kor,media,product
0,0,honglidong,90,98,85,100,80,90.6,2.952646e+23
1,1,hongeedong,80,89,95,90,80,86.8,1.786947e+23
2,2,hongsamdong,70,95,100,90,80,87.0,1.735189e+23
3,3,Juan,0,0,0,0,0,0.0,0.0
4,4,2,90,80,70,60,77,0.0,0.0


In [233]:
df_prueba.reset_index()

Unnamed: 0,level_0,index,name,math,eng,music,phys_tra,kor,media,product
0,0,0,honglidong,90,98,85,100,80,90.6,2.952646e+23
1,1,1,hongeedong,80,89,95,90,80,86.8,1.786947e+23
2,2,2,hongsamdong,70,95,100,90,80,87.0,1.735189e+23
3,3,3,Juan,0,0,0,0,0,0.0,0.0
4,4,4,2,90,80,70,60,77,0.0,0.0


In [234]:
df.iloc[4,0] = 'hongsadong' # se refiere a df, no a df_prueba
df

  df.iloc[4,0] = 'hongsadong' # se refiere a df, no a df_prueba


Unnamed: 0,index,name,math,eng,music,phys_tra,kor,media,product
0,0,honglidong,90,98,85,100,80,90.6,2.952646e+23
1,1,hongeedong,80,89,95,90,80,86.8,1.786947e+23
2,2,hongsamdong,70,95,100,90,80,87.0,1.735189e+23
3,3,Juan,0,0,0,0,0,0.0,0.0
4,hongsadong,2,90,80,70,60,77,0.0,0.0


In [235]:
# volvemos a poner la columna name como índice:
df.set_index('name', inplace=True)
df

Unnamed: 0_level_0,index,math,eng,music,phys_tra,kor,media,product
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
honglidong,0,90,98,85,100,80,90.6,2.952646e+23
hongeedong,1,80,89,95,90,80,86.8,1.786947e+23
hongsamdong,2,70,95,100,90,80,87.0,1.735189e+23
Juan,3,0,0,0,0,0,0.0,0.0
2,hongsadong,90,80,70,60,77,0.0,0.0


In [236]:
# slide 244

exam_data = {'name':['stu1', 'stu2', 'stu3'],
            'math':[90, 80, 70],
            'eng':[98, 89, 95],
            'mus':[85, 95, 100],
            'phy':[100, 90, 90]}
df = pd.DataFrame(exam_data)
df

Unnamed: 0,name,math,eng,mus,phy
0,stu1,90,98,85,100
1,stu2,80,89,95,90
2,stu3,70,95,100,90


In [237]:
df.index

RangeIndex(start=0, stop=3, step=1)

In [238]:
df.set_index('name', inplace=True)
df

Unnamed: 0_level_0,math,eng,mus,phy
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
stu1,90,98,85,100
stu2,80,89,95,90
stu3,70,95,100,90


In [239]:
df.index

Index(['stu1', 'stu2', 'stu3'], dtype='object', name='name')

In [240]:
df.iloc[0,3] = 80
df

Unnamed: 0_level_0,math,eng,mus,phy
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
stu1,90,98,85,80
stu2,80,89,95,90
stu3,70,95,100,90


In [246]:
df.loc['stu1']['phy'] = 90
df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df.loc['stu1']['phy'] = 90


Unnamed: 0_level_0,math,eng,mus,phy
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
stu1,90,98,85,90
stu2,80,89,95,90
stu3,70,95,100,90


In [245]:
df.loc['stu1']['phy'] = 100
df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df.loc['stu1']['phy'] = 100


Unnamed: 0_level_0,math,eng,mus,phy
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
stu1,90,98,85,100
stu2,80,89,95,90
stu3,70,95,100,90


In [243]:
df.loc['stu1']['mus','phy'] = 50
df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df.loc['stu1']['mus','phy'] = 50


IndexingError: Too many indexers

In [244]:
df.loc['stu1']['mus','phy'] = 70,80
df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df.loc['stu1']['mus','phy'] = 70,80


IndexingError: Too many indexers

In [294]:
df.reset_index(inplace=True)

In [247]:
df

Unnamed: 0_level_0,math,eng,mus,phy
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
stu1,90,98,85,90
stu2,80,89,95,90
stu3,70,95,100,90


### Transponer el DataFrame

In [248]:
dff = df.transpose()
dff

name,stu1,stu2,stu3
math,90,80,70
eng,98,89,95
mus,85,95,100
phy,90,90,90


In [249]:
# también podemos utilizar la propiedad .T

dff.T

Unnamed: 0_level_0,math,eng,mus,phy
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
stu1,90,98,85,90
stu2,80,89,95,90
stu3,70,95,100,90


In [250]:
# slide 254

exam_data = {'name':['stu1', 'stu2', 'stu3'],
            'math':[90, 80, 70],
            'eng':[98, 89, 95],
            'mus':[85, 95, 100],
            'phy':[100, 90, 90]}
df = pd.DataFrame(exam_data)
df

Unnamed: 0,name,math,eng,mus,phy
0,stu1,90,98,85,100
1,stu2,80,89,95,90
2,stu3,70,95,100,90


In [251]:
ndf = df.set_index(['name'])
ndf

Unnamed: 0_level_0,math,eng,mus,phy
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
stu1,90,98,85,100
stu2,80,89,95,90
stu3,70,95,100,90


In [252]:
ndf2 = ndf.set_index('mus')
ndf2

Unnamed: 0_level_0,math,eng,phy
mus,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
85,90,98,100
95,80,89,90
100,70,95,90


In [253]:
ndf3 = ndf.set_index(['math','mus'])
ndf3

Unnamed: 0_level_0,Unnamed: 1_level_0,eng,phy
math,mus,Unnamed: 2_level_1,Unnamed: 3_level_1
90,85,98,100
80,95,89,90
70,100,95,90


In [305]:
ndf3.index

MultiIndex([(90,  85),
            (80,  95),
            (70, 100)],
           names=['math', 'mus'])

In [254]:
dic_data = {'c0':[ 1, 2, 3],
            'c1':[ 4, 5, 6],
            'c2':[ 7, 8, 9],
            'c3':[10,11,12],
            'c4':[13,14,15],
           }
df = pd.DataFrame(dic_data, index=['r0','r1','r2'])
df

Unnamed: 0,c0,c1,c2,c3,c4
r0,1,4,7,10,13
r1,2,5,8,11,14
r2,3,6,9,12,15


In [255]:
new_index = ['r0', 'r1', 'r2', 'r3', 'r4']
ndf = df.reindex(new_index)
ndf

Unnamed: 0,c0,c1,c2,c3,c4
r0,1.0,4.0,7.0,10.0,13.0
r1,2.0,5.0,8.0,11.0,14.0
r2,3.0,6.0,9.0,12.0,15.0
r3,,,,,
r4,,,,,


In [256]:
ndf2 = df.reindex(new_index, fill_value=0)
ndf2

Unnamed: 0,c0,c1,c2,c3,c4
r0,1,4,7,10,13
r1,2,5,8,11,14
r2,3,6,9,12,15
r3,0,0,0,0,0
r4,0,0,0,0,0


In [257]:
df

Unnamed: 0,c0,c1,c2,c3,c4
r0,1,4,7,10,13
r1,2,5,8,11,14
r2,3,6,9,12,15


In [258]:
ndf = df.reset_index()
ndf

Unnamed: 0,index,c0,c1,c2,c3,c4
0,r0,1,4,7,10,13
1,r1,2,5,8,11,14
2,r2,3,6,9,12,15


In [259]:
df

Unnamed: 0,c0,c1,c2,c3,c4
r0,1,4,7,10,13
r1,2,5,8,11,14
r2,3,6,9,12,15


In [264]:
ndf = df.sort_index(ascending=False)
ndf

Unnamed: 0,c0,c1,c2,c3,c4
r2,3,6,9,12,15
r1,2,5,8,11,14
r0,1,4,7,10,13


In [268]:
# slide 265
#Ordenar indices por columnas
#Ordenación
ndf = df.sort_values(by='c1', ascending=False)
ndf

Unnamed: 0,c0,c1,c2,c3,c4
r2,3,6,9,12,15
r1,2,5,8,11,14
r0,1,4,7,10,13


### Series operators

In [271]:
#ejemplo de operaciones de series
ndf['c5']=ndf.c0+ndf.c1
#otra forma de hacerlo
ndf['c5']=ndf.c0.add(ndf.c1, fill_value=0)
ndf

Unnamed: 0,c0,c1,c2,c3,c4,c5
r2,3,6,9,12,15,9
r1,2,5,8,11,14,7
r0,1,4,7,10,13,5


In [272]:
student1 = pd.Series({'kor':100, 'eng':80, 'math':90})
student1

kor     100
eng      80
math     90
dtype: int64

In [273]:
percentage = student1 / 100
percentage

kor     1.0
eng     0.8
math    0.9
dtype: float64

In [274]:
student1 = pd.Series({'kor' :100, 'eng':80, 'math':90})
student2 = pd.Series({'math': 80, 'kor':90, 'eng' :80}) # el orden difiere
student2

math    80
kor     90
eng     80
dtype: int64

In [275]:
student1.index

Index(['kor', 'eng', 'math'], dtype='object')

In [276]:
student2.index

Index(['math', 'kor', 'eng'], dtype='object')

In [277]:
student1 + student2

eng     160
kor     190
math    170
dtype: int64

In [278]:
student2 + student1

eng     160
kor     190
math    170
dtype: int64

In [279]:
student1 - student2

eng      0
kor     10
math    10
dtype: int64

In [280]:
student1 * student2

eng     6400
kor     9000
math    7200
dtype: int64

In [281]:
result = pd.DataFrame([student1 + student2,
                       student1 - student2,
                       student1 * student2,
                       student1 / student2],
                     index=['addition', 'subtraction','multiplication','division'])
result # el orden de las columnas difiere de las slides Samsung

Unnamed: 0,eng,kor,math
addition,160.0,190.0,170.0
subtraction,0.0,10.0,10.0
multiplication,6400.0,9000.0,7200.0
division,1.0,1.111111,1.125


Cuando un valor no está en una de las series se entiende que vale NaN (forma que tiene Pandas para indicar missing values, realmente NaN no se pensó para eso :( )

In [282]:
student1 = pd.Series({'kor' :np.nan, 'eng':80, 'math':90})
student2 = pd.Series({'math': 80, 'kor':90})

In [283]:
pd.DataFrame([student1 + student2,
              student1 - student2,
              student1 * student2,
              student1 / student2],
             index=['addition', 'subtraction','multiplication','division'])

Unnamed: 0,eng,kor,math
addition,,,170.0
subtraction,,,10.0
multiplication,,,7200.0
division,,,1.125


Podemos utilizar `fill_value` como sigue, observa que las operaciones suma, resta, etc. se realizan con un método en lugar de los operadores +,-, etc.

In [284]:
student1 = pd.Series({'kor' :np.nan, 'eng':80, 'math':90})
student2 = pd.Series({'math': 80, 'kor':90})

sr_add = student1.add(student2, fill_value=0)
sr_sub = student1.sub(student2, fill_value=0)
sr_mul = student1.mul(student2, fill_value=0)
sr_div = student1.div(student2, fill_value=0)

pd.DataFrame([sr_add, sr_sub, sr_mul, sr_div],
             index=['addition', 'subtraction','multiplication','division'])

Unnamed: 0,eng,kor,math
addition,80.0,90.0,170.0
subtraction,80.0,-90.0,10.0
multiplication,0.0,0.0,7200.0
division,inf,0.0,1.125


**Nota:** Aunque la biblioteca Seaborn se verá más adelante, ahora la vamos a utilizar para  importar datos:

In [286]:
import seaborn as sns

In [288]:
titanic = sns.load_dataset('titanic')
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [290]:
titanic.info()

#dtype category, typo que se usa como etiqueta, para ganar eficiencia, ejemplo de convertir un string en una letra únicamente

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [292]:
titanic.deck.unique()

[NaN, 'C', 'E', 'G', 'D', 'A', 'B', 'F']
Categories (7, object): ['A', 'B', 'C', 'D', 'E', 'F', 'G']

In [293]:
df = titanic.loc[:,['age','fare']]
df.head()

Unnamed: 0,age,fare
0,22.0,7.25
1,38.0,71.2833
2,26.0,7.925
3,35.0,53.1
4,35.0,8.05


In [294]:
df.tail()

Unnamed: 0,age,fare
886,27.0,13.0
887,19.0,30.0
888,,23.45
889,26.0,30.0
890,32.0,7.75


In [295]:
addition = df + 10
addition

Unnamed: 0,age,fare
0,32.0,17.2500
1,48.0,81.2833
2,36.0,17.9250
3,45.0,63.1000
4,45.0,18.0500
...,...,...
886,37.0,23.0000
887,29.0,40.0000
888,,33.4500
889,36.0,40.0000


In [296]:
#operaciones entre dataframes
subtraction = addition - df
subtraction.tail()

Unnamed: 0,age,fare
886,10.0,10.0
887,10.0,10.0
888,,10.0
889,10.0,10.0
890,10.0,10.0


In [304]:
#muchísimo mejor
ndf['c6']=(ndf.c4+ndf.c5)/ndf.c2
ndf

Unnamed: 0,c0,c1,c2,c3,c4,c5,c7,c6
r2,3,6,9,12,15,9,2.666667,2.666667
r1,2,5,8,11,14,7,2.625,2.625
r0,1,4,7,10,13,5,2.571429,2.571429


In [302]:
def f(x,y,z):
    return (x+y)/z
ndf['c7']=ndf.apply(lambda x: f(x.c4, x.c5, x.c2), axis=1)

ndf

Unnamed: 0,c0,c1,c2,c3,c4,c5,c7
r2,3,6,9,12,15,9,2.666667
r1,2,5,8,11,14,7,2.625
r0,1,4,7,10,13,5,2.571429


In [308]:
ndf.sort_index(axis=1, ascending=True)

Unnamed: 0,c0,c1,c2,c3,c4,c5,c6,c7
r2,3,6,9,12,15,9,2.666667,2.666667
r1,2,5,8,11,14,7,2.625,2.625
r0,1,4,7,10,13,5,2.571429,2.571429


### Unit 2.3 Merging and Binding DataFrames

In [297]:
dfA = pd.DataFrame({
    'Name':'Harry Potter,David Baker,John Smith,Juan Martinez,Jane Connor'.split(','),
    'Gender':'Male Male Male Male Female'.split(),
    'Age':[23,31,22,36,30]
    
})
dfA

Unnamed: 0,Name,Gender,Age
0,Harry Potter,Male,23
1,David Baker,Male,31
2,John Smith,Male,22
3,Juan Martinez,Male,36
4,Jane Connor,Female,30


In [298]:
dfB = pd.DataFrame({
    'Name':'John Smith,Alex Du Bois,Joanne Rowling,Jane Connor'.split(','),
    'Position':'Intern,Team Lead,Manager,Manager'.split(','),
    'Wage':[25000,75000,90000,70000]
    
})
dfB

Unnamed: 0,Name,Position,Wage
0,John Smith,Intern,25000
1,Alex Du Bois,Team Lead,75000
2,Joanne Rowling,Manager,90000
3,Jane Connor,Manager,70000


In [309]:
#Unimos dataframes por índices del dataframe de la izquierda
pd.merge(dfA,dfB,on='Name',how='left')

Unnamed: 0,Name,Gender,Age,Position,Wage
0,Harry Potter,Male,23,,
1,David Baker,Male,31,,
2,John Smith,Male,22,Intern,25000.0
3,Juan Martinez,Male,36,,
4,Jane Connor,Female,30,Manager,70000.0


In [310]:
#Unimos dataframe por índices de la tabla de la derecha
pd.merge(dfA,dfB,how='right')

Unnamed: 0,Name,Gender,Age,Position,Wage
0,John Smith,Male,22.0,Intern,25000
1,Alex Du Bois,,,Team Lead,75000
2,Joanne Rowling,,,Manager,90000
3,Jane Connor,Female,30.0,Manager,70000


In [311]:
pd.merge(dfA,dfB,how='outer')

Unnamed: 0,Name,Gender,Age,Position,Wage
0,Alex Du Bois,,,Team Lead,75000.0
1,David Baker,Male,31.0,,
2,Harry Potter,Male,23.0,,
3,Jane Connor,Female,30.0,Manager,70000.0
4,Joanne Rowling,,,Manager,90000.0
5,John Smith,Male,22.0,Intern,25000.0
6,Juan Martinez,Male,36.0,,


In [312]:
pd.merge(dfA,dfB,how='inner')

Unnamed: 0,Name,Gender,Age,Position,Wage
0,John Smith,Male,22,Intern,25000
1,Jane Connor,Female,30,Manager,70000


In [315]:
#Aquí es redundante poner el on porque la única columna que comparten los df es 'Name'
pd.merge(dfA,dfB,how='inner',on='Name')

Unnamed: 0,Name,Gender,Age,Position,Wage
0,John Smith,Male,22,Intern,25000
1,Jane Connor,Female,30,Manager,70000
