# Pandas 

reference: pandas.pydata.org

```python
import pandas as pd
```

Flujo de trabajo: Input data -> ETL -> output data + Analisis

In [1]:
import pandas as pd

In [2]:
pd.__version__

'1.5.0'

## Input Data

Pandas acepta una gran variedad de ficheros:
- csv
- txt
- hdf
- excel
- sql
- parquet
- otros

```python
# syntax para lectura de datos
pd.read_<extension>
```

### read_csv()

In [3]:
pd.read_csv("example1.csv")

Unnamed: 0,col1,col2,col3
0,1,2,3
1,4,5,6
2,7,8,9
3,10,11,12


In [4]:
df = pd.read_csv("example1.csv")

In [5]:
type(df)

pandas.core.frame.DataFrame

In [6]:
df

Unnamed: 0,col1,col2,col3
0,1,2,3
1,4,5,6
2,7,8,9
3,10,11,12


**Pandas DataFrame**

- 2 dimensiones: columns and rows 
- Tiene un index para las rows (lineas) y tambien para las columns (columnas)

In [7]:
# shape (rows, cols)
df.shape

(4, 3)

In [8]:
print(f"Este DataFrame tiene {df.shape[0]} rows y {df.shape[1]} columns")

Este DataFrame tiene 4 rows y 3 columns


In [9]:
df.index

RangeIndex(start=0, stop=4, step=1)

**example2: sep = ";"**

In [11]:
df = pd.read_csv("example2.csv")
df

Unnamed: 0,col1; col2; col3
0,1;2;3
1,4;5;6


In [12]:
df.shape

(2, 1)

In [13]:
df = pd.read_csv("example2.csv",sep=";")
df

Unnamed: 0,col1,col2,col3
0,1,2,3
1,4,5,6


In [14]:
df.shape

(2, 3)

**example3: otros argumentos**

In [15]:
df = pd.read_csv("example3.csv", sep=";")
print(df.shape)
df

(3, 3)


Unnamed: 0,col1,col2,col3
0,Roberto,38,Madrid
1,Joaquin,35,Valencia
2,Albert,60,Barcelona


In [17]:
# pasar los nombres de las columnas + skiprows
df = pd.read_csv(
    "example3.csv", 
    sep=";", 
    names=["name","age","city"],
    skiprows=1,
)

df

Unnamed: 0,name,age,city
0,Roberto,38,Madrid
1,Joaquin,35,Valencia
2,Albert,60,Barcelona


In [18]:
# pasar los nombres de las columnas + skiprows
df = pd.read_csv(
    "example3.csv", 
    sep=";", 
    usecols=["col2","col3"],
)

df


Unnamed: 0,col2,col3
0,38,Madrid
1,35,Valencia
2,60,Barcelona


In [20]:
# definir una columna como index del DF
df = pd.read_csv(
    "example4.csv", 
    sep=";",
    index_col="col4"
)

df

Unnamed: 0_level_0,col1,col2,col3
col4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,Roberto,38,Madrid
B,Joaquin,35,Valencia
C,Albert,60,Barcelona


In [21]:
df.index

Index(['A', 'B', 'C'], dtype='object', name='col4')

## Explorando las propriedades de un Pandas DataFrame

In [22]:
titanic_df = pd.read_csv("titanic.csv")
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [23]:
titanic_df.shape

(891, 12)

In [24]:
# column names
titanic_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [26]:
"country" in titanic_df.columns

False

In [27]:
"Ticket" in titanic_df.columns

True

In [28]:
print(list(titanic_df.columns))

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


In [29]:
# Head -> .head(n) -> 5 first rows by default
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [33]:
titanic_df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [34]:
# tail(n) ->5 last rows by default 
titanic_df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [35]:
titanic_df.tail(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [36]:
titanic_tail_df = titanic_df.tail(3)
print(titanic_tail_df.shape)
titanic_tail_df

(3, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [37]:
# data types
titanic_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [38]:
# info()
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [39]:
# describe()
titanic_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Seleccionar columnas de un DF

In [40]:
titanic_df["Survived"]

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [41]:
type(titanic_df["Survived"])

pandas.core.series.Series

In [42]:
titanic_df["Age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [43]:
titanic_df.Age

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [46]:
# selecting multiple columnas
titanic_df[["Survived","Age","Fare"]]

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.2500
1,1,38.0,71.2833
2,1,26.0,7.9250
3,1,35.0,53.1000
4,0,35.0,8.0500
...,...,...,...
886,0,27.0,13.0000
887,1,19.0,30.0000
888,0,,23.4500
889,1,26.0,30.0000


In [45]:
type(titanic_df[["Survived","Age"]])

pandas.core.frame.DataFrame

## Subsetting a DF (loc and iloc)

- loc -> label based index
- iloc -> integer/position based index

In [51]:
titanic_df = pd.read_csv("titanic.csv", index_col="Name")
titanic_df

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.2500,,S
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1000,C123,S
"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
"Montvila, Rev. Juozas",887,0,2,male,27.0,0,0,211536,13.0000,,S
"Graham, Miss. Margaret Edith",888,1,1,female,19.0,0,0,112053,30.0000,B42,S
"Johnston, Miss. Catherine Helen ""Carrie""",889,0,3,female,,1,2,W./C. 6607,23.4500,,S
"Behr, Mr. Karl Howell",890,1,1,male,26.0,0,0,111369,30.0000,C148,C


In [52]:
titanic_df.index

Index(['Braund, Mr. Owen Harris',
       'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
       'Heikkinen, Miss. Laina',
       'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
       'Allen, Mr. William Henry', 'Moran, Mr. James',
       'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard',
       'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
       'Nasser, Mrs. Nicholas (Adele Achem)',
       ...
       'Markun, Mr. Johann', 'Dahlberg, Miss. Gerda Ulrika',
       'Banfield, Mr. Frederick James', 'Sutehall, Mr. Henry Jr',
       'Rice, Mrs. William (Margaret Norton)', 'Montvila, Rev. Juozas',
       'Graham, Miss. Margaret Edith',
       'Johnston, Miss. Catherine Helen "Carrie"', 'Behr, Mr. Karl Howell',
       'Dooley, Mr. Patrick'],
      dtype='object', name='Name', length=891)

In [55]:
titanic_df

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.2500,,S
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1000,C123,S
"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
"Montvila, Rev. Juozas",887,0,2,male,27.0,0,0,211536,13.0000,,S
"Graham, Miss. Margaret Edith",888,1,1,female,19.0,0,0,112053,30.0000,B42,S
"Johnston, Miss. Catherine Helen ""Carrie""",889,0,3,female,,1,2,W./C. 6607,23.4500,,S
"Behr, Mr. Karl Howell",890,1,1,male,26.0,0,0,111369,30.0000,C148,C


In [56]:
# .loc[] -> label index 
titanic_df.loc['Heikkinen, Miss. Laina']

PassengerId                   3
Survived                      1
Pclass                        3
Sex                      female
Age                        26.0
SibSp                         0
Parch                         0
Ticket         STON/O2. 3101282
Fare                      7.925
Cabin                       NaN
Embarked                      S
Name: Heikkinen, Miss. Laina, dtype: object

In [57]:
first_row = titanic_df.loc["Braund, Mr. Owen Harris"]
first_row

PassengerId            1
Survived               0
Pclass                 3
Sex                 male
Age                 22.0
SibSp                  1
Parch                  0
Ticket         A/5 21171
Fare                7.25
Cabin                NaN
Embarked               S
Name: Braund, Mr. Owen Harris, dtype: object

In [59]:
first_row["Age"]

22.0

In [60]:
titanic_df.iloc[0]

PassengerId            1
Survived               0
Pclass                 3
Sex                 male
Age                 22.0
SibSp                  1
Parch                  0
Ticket         A/5 21171
Fare                7.25
Cabin                NaN
Embarked               S
Name: Braund, Mr. Owen Harris, dtype: object

In [61]:
titanic_df.iloc[-1]

PassengerId       891
Survived            0
Pclass              3
Sex              male
Age              32.0
SibSp               0
Parch               0
Ticket         370376
Fare             7.75
Cabin             NaN
Embarked            Q
Name: Dooley, Mr. Patrick, dtype: object

In [62]:
titanic_df.tail(1)

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Dooley, Mr. Patrick",891,0,3,male,32.0,0,0,370376,7.75,,Q


In [63]:
selected_names = [
    "Dooley, Mr. Patrick",
    "Braund, Mr. Owen Harris",
]

sample_df = titanic_df.loc[selected_names]
sample_df

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Dooley, Mr. Patrick",891,0,3,male,32.0,0,0,370376,7.75,,Q
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.25,,S


In [65]:
selected_names = [
    "Dooley, Mr. Patrick",
    "Braund, Mr. Owen Harris",
]

titanic_df.loc[selected_names , ["PassengerId","Survived","Age"]]

Unnamed: 0_level_0,PassengerId,Survived,Age
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Dooley, Mr. Patrick",891,0,32.0
"Braund, Mr. Owen Harris",1,0,22.0


In [76]:
titanic_df.iloc[:3,-3:]

Unnamed: 0_level_0,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Braund, Mr. Owen Harris",7.25,,S
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",71.2833,C85,C
"Heikkinen, Miss. Laina",7.925,,S


In [77]:
# rename columns
titanic_df

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.2500,,S
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1000,C123,S
"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
"Montvila, Rev. Juozas",887,0,2,male,27.0,0,0,211536,13.0000,,S
"Graham, Miss. Margaret Edith",888,1,1,female,19.0,0,0,112053,30.0000,B42,S
"Johnston, Miss. Catherine Helen ""Carrie""",889,0,3,female,,1,2,W./C. 6607,23.4500,,S
"Behr, Mr. Karl Howell",890,1,1,male,26.0,0,0,111369,30.0000,C148,C


In [78]:
titanic_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [79]:
new_columns = []
for col in titanic_df.columns:
    new_columns.append(str(col).strip().lower())
new_columns

['passengerid',
 'survived',
 'pclass',
 'sex',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked']

In [80]:
titanic_df.columns = new_columns
titanic_df

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.2500,,S
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1000,C123,S
"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
"Montvila, Rev. Juozas",887,0,2,male,27.0,0,0,211536,13.0000,,S
"Graham, Miss. Margaret Edith",888,1,1,female,19.0,0,0,112053,30.0000,B42,S
"Johnston, Miss. Catherine Helen ""Carrie""",889,0,3,female,,1,2,W./C. 6607,23.4500,,S
"Behr, Mr. Karl Howell",890,1,1,male,26.0,0,0,111369,30.0000,C148,C


## Filtering rows on conditions

* filter
* mask + loc
* where
* query

In [81]:
sample_df = titanic_df.sample(5)
sample_df

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Smart, Mr. John Montgomery",468,0,1,male,56.0,0,0,113792,26.55,,S
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S
"Horgan, Mr. John",614,0,3,male,,0,0,370377,7.75,,Q
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S


In [82]:
sample_df["sex"] == "female"

Name
Smart, Mr. John Montgomery                        False
Sjoblom, Miss. Anna Sofia                          True
Horgan, Mr. John                                  False
Rothschild, Mrs. Martin (Elizabeth L. Barrett)     True
Mellinger, Miss. Madeleine Violet                  True
Name: sex, dtype: bool

In [83]:
filter_sex = sample_df["sex"] == "female"
sample_df[filter_sex]

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S


In [84]:
sample_df[sample_df["sex"] == "female"]

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S


In [85]:
# not female -> ~
sample_df[~(sample_df["sex"] == "female")]

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Smart, Mr. John Montgomery",468,0,1,male,56.0,0,0,113792,26.55,,S
"Horgan, Mr. John",614,0,3,male,,0,0,370377,7.75,,Q


In [86]:
# best practices -> .loc[filter]
filter_sex = sample_df["sex"] == "female"
sample_df.loc[filter_sex,:]

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S


In [87]:
sample_df.loc[filter_sex,["survived","ticket"]]

Unnamed: 0_level_0,survived,ticket
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
"Sjoblom, Miss. Anna Sofia",1,3101265
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",1,PC 17603
"Mellinger, Miss. Madeleine Violet",1,250644


In [88]:
# query
sample_df.query("sex == 'female'")

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S


In [89]:
# multiple filters: & (and) y | (or)
filter_sex = sample_df["sex"] == "female"
filter_age = sample_df["age"] <= 20

sample_df[(filter_sex)&(filter_age)]

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S


In [90]:
# or
sample_df[(filter_sex)|(filter_age)]

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S


In [92]:
# error: and/or
sample_df[(filter_sex)or(filter_age)]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

## sort_values(), duplicados(), null values

In [93]:
# sort DF
sample_df 

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Smart, Mr. John Montgomery",468,0,1,male,56.0,0,0,113792,26.55,,S
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S
"Horgan, Mr. John",614,0,3,male,,0,0,370377,7.75,,Q
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S


In [94]:
# sort by index
sample_df.sort_index()

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Horgan, Mr. John",614,0,3,male,,0,0,370377,7.75,,Q
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S
"Smart, Mr. John Montgomery",468,0,1,male,56.0,0,0,113792,26.55,,S


In [95]:
# sort by index
sample_df.sort_index(ascending=False)

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Smart, Mr. John Montgomery",468,0,1,male,56.0,0,0,113792,26.55,,S
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S
"Horgan, Mr. John",614,0,3,male,,0,0,370377,7.75,,Q


In [96]:
sample_df.sort_values("fare",ascending=False)

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Smart, Mr. John Montgomery",468,0,1,male,56.0,0,0,113792,26.55,,S
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S
"Horgan, Mr. John",614,0,3,male,,0,0,370377,7.75,,Q
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S


In [97]:
sample_df.sort_values(["pclass","fare"],ascending=[True,False])

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Smart, Mr. John Montgomery",468,0,1,male,56.0,0,0,113792,26.55,,S
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S
"Horgan, Mr. John",614,0,3,male,,0,0,370377,7.75,,Q
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S


**duplicates**

In [98]:
sample_df.drop_duplicates()

Unnamed: 0_level_0,passengerid,survived,pclass,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Smart, Mr. John Montgomery",468,0,1,male,56.0,0,0,113792,26.55,,S
"Sjoblom, Miss. Anna Sofia",787,1,3,female,18.0,0,0,3101265,7.4958,,S
"Horgan, Mr. John",614,0,3,male,,0,0,370377,7.75,,Q
"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",514,1,1,female,54.0,1,0,PC 17603,59.4,,C
"Mellinger, Miss. Madeleine Violet",447,1,2,female,13.0,0,1,250644,19.5,,S


In [99]:
df = pd.DataFrame({
    "name":["Roberto","Roberto","Joaquin"],
    "age":[38,38,38]
})

df

Unnamed: 0,name,age
0,Roberto,38
1,Roberto,38
2,Joaquin,38


In [100]:
df.drop_duplicates()

Unnamed: 0,name,age
0,Roberto,38
2,Joaquin,38


In [101]:
df.drop_duplicates(keep='last')

Unnamed: 0,name,age
1,Roberto,38
2,Joaquin,38


In [104]:
df.drop_duplicates(subset=["age"],keep="last")

Unnamed: 0,name,age
2,Joaquin,38


**null values**

In [105]:
titanic_df = pd.read_csv("titanic.csv")
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [106]:
titanic_df.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [107]:
titanic_df["Cabin"].isnull()

0       True
1      False
2       True
3      False
4       True
       ...  
886     True
887    False
888     True
889    False
890     True
Name: Cabin, Length: 891, dtype: bool

In [108]:
titanic_df["Cabin"].isnull().sum()

687

In [109]:
titanic_df["Cabin"].isnull().mean()

0.7710437710437711

In [110]:
titanic_df.isnull().mean()

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

In [111]:
titanic_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64