# Intro to Pandas
- [https://pandas.pydata.org/](https://pandas.pydata.org/)
- a fast, powerful, flexible and easy to use open source data analysis and manipulation tool built on Python

## What kind of data does pandas handle?

### pandas data table representation
![img](images/Pandas-Table.svg)
- to work with pandas package, must first import the package
- to install/update pandas you can use either conda or pip

```bash
conda install pandas
pip install pandas
```

In [1]:
import pandas as pd
import numpy as np

In [6]:
print(f'pandas version: {pd.__version__}')
print(f'numpy version: {np.__version__}')

pandas version: 2.2.0
numpy version: 1.26.4


## Series

- series is 1-d labeled array capable of holding any data type (integers, strings, float, Python objects, etc.)
- the axis labels collectively referred to as the **index**
- API to create Series:
```python
s = pd.Series(data, index=index)
```

- data can be:
    - NumPy's **ndarray**
    - Python dictionary
    - a scalar value (e.g. 5)
    - Python List
- index is a list of axis labels
    - index can be thought as row id or sample id
- if data is an ndarray, index must be the same length as data
    - if no index is passed, default index will be created `[0, ..., len(data)-1]`
- each column in the DataFrame is a Series

In [3]:
s = pd.Series(np.random.randn(5))

In [4]:
s

0   -1.050908
1    1.145567
2   -1.122552
3   -0.350206
4    0.794316
dtype: float64

In [5]:
s.index

RangeIndex(start=0, stop=5, step=1)

In [6]:
s1 = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [7]:
s1

a   -0.542334
b    0.717278
c   -0.258132
d    0.307099
e   -0.719742
dtype: float64

In [8]:
s1.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [9]:
# from dict
d = {"b": 1, "a": 0, "c": 2}
s2 = pd.Series(d)

In [10]:
s2

b    1
a    0
c    2
dtype: int64

In [11]:
# scalar value is repeated to match the length of index
s3 = pd.Series(5.0, index=["a", "b", "c", "d", "e"])

In [12]:
s3

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

In [13]:
# creating Series from Python List
s4 = pd.Series([1, 3, 5, np.nan, 6, 8])

In [14]:
s4

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

### Series is ndarray-like
- Series acts very similar to a ndarray, and is a valid argument to most NumPy functions
- slicing Series also slices the index

In [15]:
s1

a   -0.048182
b    0.713255
c    0.270073
d   -0.450289
e   -0.782588
dtype: float64

In [16]:
s1[0]

-0.04818225722193457

In [17]:
s1[3:]

d   -0.450289
e   -0.782588
dtype: float64

In [18]:
# slice using condition
s1[s1 > s1.median()]

b    0.713255
c    0.270073
dtype: float64

In [19]:
# slice using indices
s1[[4, 3, 1]]

e   -0.782588
d   -0.450289
b    0.713255
dtype: float64

In [20]:
# calculate the exponential (2*n) of each element n in the ndarray
# https://numpy.org/doc/stable/reference/generated/numpy.exp.html?highlight=exp#numpy.exp
np.exp2(s3)

a    32.0
b    32.0
c    32.0
d    32.0
e    32.0
dtype: float64

In [21]:
s3.dtype

dtype('float64')

### extract data array from Series
- extract just data as array without index

In [22]:
s3.array

<PandasArray>
[5.0, 5.0, 5.0, 5.0, 5.0]
Length: 5, dtype: float64

### convert series to ndarray

In [23]:
ndarr = s3.to_numpy()

In [24]:
ndarr

array([5., 5., 5., 5., 5.])

In [25]:
type(ndarr)

numpy.ndarray

In [26]:
ndarr.size

5

In [27]:
ndarr.shape

(5,)

### Series is dict-like
- use index as the key to get the corresponding value

In [28]:
s1['a']

-0.04818225722193457

In [29]:
s3['e']

5.0

In [30]:
s3['e'] = 15.0

In [31]:
s3

a     5.0
b     5.0
c     5.0
d     5.0
e    15.0
dtype: float64

In [32]:
s3['g']

KeyError: 'g'

In [33]:
# use get with default value if key is missing
s3.get('g', np.nan)

nan

### Vectorized operations and label alignment with Series
- very similar to NumPy ndarray

In [34]:
s3

a     5.0
b     5.0
c     5.0
d     5.0
e    15.0
dtype: float64

In [35]:
s3+s3

a    10.0
b    10.0
c    10.0
d    10.0
e    30.0
dtype: float64

In [36]:
s3-s3

a    0.0
b    0.0
c    0.0
d    0.0
e    0.0
dtype: float64

In [37]:
s3*2

a    10.0
b    10.0
c    10.0
d    10.0
e    30.0
dtype: float64

In [38]:
s3/5

a    1.0
b    1.0
c    1.0
d    1.0
e    3.0
dtype: float64

In [39]:
# Series automatically aligns the data based on label
# if the label is not found in one Series or the other, the result will be marked as missing NaN
s3[1:] + s3[:-1]

a     NaN
b    10.0
c    10.0
d    10.0
e     NaN
dtype: float64

### Name attribute
- Series can also have a **name** atribute

In [40]:
s4 = pd.Series(np.random.randn(5), name="Some Name")

In [41]:
s4.name

'Some Name'

In [42]:
s4

0    0.698834
1   -0.913744
2    0.496605
3    1.581900
4   -1.805908
Name: Some Name, dtype: float64

In [43]:
# Series.rename creates a new Series with new name
s5 = s4.rename('New Name')

In [44]:
s5

0    0.698834
1   -0.913744
2    0.496605
3    1.581900
4   -1.805908
Name: New Name, dtype: float64

## DataFrame
- data table in pandas is called DataFrame
- DataFrame is the primary data structure of pandas
- Python dict can be used create DataFrame where keys will be used as column headers and the list of values as columns of the DataFrame
- each column of DataFrame is called `Series`

In [45]:
aDict = {
    "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth"
    ],
    "Age": [22, 35, 58],
    "Sex": ["male", "male", "female"]
}

In [46]:
df = pd.DataFrame(aDict)

In [47]:
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


In [48]:
df2 = pd.DataFrame(
{
    "A": 1.0,
    "B": pd.Timestamp("20130102"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo",
})

In [49]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


### spreadsheet data
- the above df can be represented in a spreadsheet software
![SpreadSheet](./images/01_table_spreadsheet.png)

In [50]:
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


In [51]:
# just work with the data in column - Age
# use dictionary syntax
df["Age"]

0    22
1    35
2    58
Name: Age, dtype: int64

In [52]:
# access series/column as attribute
df.Age

0    22
1    35
2    58
Name: Age, dtype: int64

## DataFrame Complete Reference
- complete reference: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

## DataFrame utility methods and attributes to review data
- `df.columns` - return the column labels of the DataFrame
- `df.index` - return the index (row labels/ids) of the DataFrame df object
- `df.dtypes` - return Series with the data type of each column in the df object
- `df.values` - return a **NumPy** representation of the DataFrame df object
- `df.axes` - return a list representing the axes of the DataFrame, `[row labels]` and `[column labels]`
- `df.shape` - return a tuple representing the dimensionality of the DataFrame df object
- `df.size` - return an int representing the number of elements in the DataFrame df object
- `df.info()` - print a concise summary of a DataFrame df object
- `df.describe()` - generate descriptive statistics
- `df.head(n)` - display the first n rows in the DataFarme df object; default n=5
- `df.tail(n)` - display the last n rows in the DataFrame df object; default n=5

In [53]:
df.columns

Index(['Name', 'Age', 'Sex'], dtype='object')

In [54]:
df.index

RangeIndex(start=0, stop=3, step=1)

In [55]:
df.values

array([['Braund, Mr. Owen Harris', 22, 'male'],
       ['Allen, Mr. William Henry', 35, 'male'],
       ['Bonnell, Miss. Elizabeth', 58, 'female']], dtype=object)

In [56]:
df.dtypes

Name    object
Age      int64
Sex     object
dtype: object

In [57]:
df.axes

[RangeIndex(start=0, stop=3, step=1),
 Index(['Name', 'Age', 'Sex'], dtype='object')]

In [58]:
df.shape

(3, 3)

In [59]:
df.size

9

In [60]:
# generate descriptive statistics
df.describe()

Unnamed: 0,Age
count,3.0
mean,38.333333
std,18.230012
min,22.0
25%,28.5
50%,35.0
75%,46.5
max,58.0


In [61]:
# print first 2 rows
df.head(2)

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male


In [62]:
# get individual stats for each Searies
df['Age'].max()

58

In [63]:
# print last 2 rows
df.tail(2)

Unnamed: 0,Name,Age,Sex
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


## Read and write tabular data

![](./images/02_io_readwrite1.svg)

- use pandas `.read_*(fileName)` to read data from various formats
- Pandas raw data: [https://github.com/pandas-dev/pandas/tree/master/doc/data](https://github.com/pandas-dev/pandas/tree/master/doc/data)
- read_csv - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [64]:
# read CSV file directly from the Internet
iris_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/iris.data')

In [65]:
iris_df.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [66]:
iris_df.tail()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [67]:
# technical summary of DataFrame
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   SepalLength  150 non-null    float64
 1   SepalWidth   150 non-null    float64
 2   PetalLength  150 non-null    float64
 3   PetalWidth   150 non-null    float64
 4   Name         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [68]:
# statistical summary of iris dataset
iris_df.describe()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### Titanic Dataset

- https://github.com/pandas-dev/pandas/blob/master/doc/data/titanic.csv
- https://www.openml.org/d/40945

#### Column name description

```
PassengerId: Id of every passenger.

Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.

Pclass: Passenger class: 3 classes: Class 1, Class 2 and Class 3.

Name: Name of passenger.

Sex: Gender of passenger.

Age: Age of passenger.

SibSp: Indication that passenger have siblings and spouse.

Parch: Whether a passenger is alone or have family.

Ticket: Ticket number of passenger.

Fare: Indicating the fare.

Cabin: The cabin of passenger.

Embarked: The embarked category.
```

In [69]:
# let's read titanic.csv file as DataFrame
titanicDf = pd.read_csv('data/titanic.csv')

In [70]:
titanicDf
# notice the dataset already provides PassengerId as index column
# read_csv automatically adds the index column or row id

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [15]:
# let's read the csv with PassengerId as index column (row_id)
titanicDf = pd.read_csv('data/titanic.csv', index_col="PassengerId")

In [16]:
# print first 8 rows
titanicDf.head(8)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [17]:
titanicDf.shape

(891, 11)

In [18]:
titanicDf.dtypes

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

## Sort table rows
- based on some column name
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

```python
DataFrame.sort_values(by='columnName', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
```

- returns the sorted DataFrame (NOT an inplace sort by default)

In [19]:
sortedTitanicDf = titanicDf.sort_values(by='Age')

In [20]:
sortedTitanicDf.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S
645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S


In [21]:
# sorting by default returns sorted DF without sorting original DF in place
titanicDf.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [22]:
# sort using multiple columns and in descending order
titanicDf.sort_values(by=['Pclass', 'Age'], ascending=False).head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
281,0,3,"Duane, Mr. Frank",male,65.0,0,0,336439,7.75,,Q
484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
327,0,3,"Nysveen, Mr. Johan Hansen",male,61.0,0,0,345364,6.2375,,S


### Write the DataFrame as Excel file

- install openpyxl library from Terminal; doesn't seem to work from notebook

```bash
conda activate ml
conda install -y openpyxl
```

In [23]:
! conda install -y openpyxl

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [24]:
sortedTitanicDf.to_excel('data/titanic_sorted_age.xlsx', sheet_name='passengers')

In [25]:
# technical summary of DataFrame
sortedTitanicDf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 804 to 889
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


## Select a subset of a DataFrame

### Select specific columns
![](./images/03_subset_columns.svg)

In [26]:
# copy just the Age column or Series
ages = titanicDf['Age']

In [27]:
type(ages)

pandas.core.series.Series

In [28]:
ages.shape

(891,)

In [29]:
# get age and sex columns
age_sex = titanicDf[['Age', 'Sex']]

In [30]:
type(age_sex)

pandas.core.frame.DataFrame

In [31]:
age_sex.head()

Unnamed: 0_level_0,Age,Sex
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,22.0,male
2,38.0,female
3,26.0,female
4,35.0,female
5,35.0,male


In [32]:
age_sex.shape

(891, 2)

In [33]:
# boolen mask of passengers older than 35; returns True or False based on condition
titanicDf['Age'] > 35

PassengerId
1      False
2       True
3      False
4      False
5      False
       ...  
887    False
888    False
889    False
890    False
891    False
Name: Age, Length: 891, dtype: bool

In [34]:
# DF of passengers older than 35
# passengers older than 35; returns True or False based on condition
titanicDf[titanicDf['Age'] > 35]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S
16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...
866,1,2,"Bystrom, Mrs. (Karolina)",female,42.0,0,0,236852,13.0000,,S
872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,,S
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C


## Select specific rows and columns
- create new DataFrame using the filtered rows and columns

- two ways:

### df.iloc selections
- use row ids and column ids
```python
df.iloc[[row selection], [column selection]]
```
- row selection can be:
    - single row index values: `[100]`
    - integer list of row indices: `[0, 2, 10]`
    - slice of row indices: `[2:10]`
        
- column selection can be:
    - single column selection: `[10]`
    - integer list of col indices: `[0, 3, 5]`
    - slice of column indices: `[3:10]`
    

### df.loc selection
- use row labels column labels

```python
df.loc[[row selection], [column selection]]
```
- row selection:
    - single row label/index: `["john"]`
    - list of row labels: `["john", "sarah"]`
    - condition: `[data['age'] >= 35]`
- column selection:
    - single column name name: `['Age']`
    - list of column names: `['Name, 'Age', 'Sex']`
    - slice of column names: `['Name':'Age']`
    
    
### Select specific rows and all the columns
![](images/03_subset_rows.svg)

In [35]:
# Create new DataFrame based on the criteria
# similar to using where clause in SQL
passengers = titanicDf[titanicDf['Age']>35]

In [36]:
passengers.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S


In [37]:
passengers.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,217.0,217.0,217.0,217.0,217.0,217.0
mean,0.382488,1.81106,46.979263,0.345622,0.465438,43.966821
std,0.487119,0.858653,9.188272,0.522983,1.075809,56.083306
min,0.0,1.0,36.0,0.0,0.0,0.0
25%,0.0,1.0,40.0,0.0,0.0,12.525
50%,0.0,2.0,45.0,0.0,0.0,26.3875
75%,1.0,3.0,52.0,1.0,0.0,55.9
max,1.0,3.0,80.0,2.0,6.0,512.3292


In [38]:
# slect all passengers who survived - rows with Survived column = 1
survived = titanicDf[titanicDf['Survived'] == 1]

In [39]:
survived

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
...,...,...,...,...,...,...,...,...,...,...,...
876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [40]:
# another example of selection
class_23 = titanicDf[titanicDf['Pclass'].isin([2, 3])]

In [41]:
class_23.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [42]:
# select data where age is known
age_no_na = titanicDf[titanicDf['Age'].notna()]

In [43]:
age_no_na.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [44]:
age_no_na.shape

(714, 11)

In [45]:
# select rows 10-25 and columns 3-5
titanicDf.iloc[9:25, 2:5]

Unnamed: 0_level_0,Name,Sex,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0
11,"Sandstrom, Miss. Marguerite Rut",female,4.0
12,"Bonnell, Miss. Elizabeth",female,58.0
13,"Saundercock, Mr. William Henry",male,20.0
14,"Andersson, Mr. Anders Johan",male,39.0
15,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0
16,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0
17,"Rice, Master. Eugene",male,2.0
18,"Williams, Mr. Charles Eugene",male,
19,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0


In [46]:
# select rows based on row_ids or PassengerId
titanicDf.loc[[1, 3]]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [47]:
# slect rows based on row_ids and columns based on column ids
titanicDf.loc[[1, 3], ['Age', 'Name']]

Unnamed: 0_level_0,Age,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,22.0,"Braund, Mr. Owen Harris"
3,26.0,"Heikkinen, Miss. Laina"


In [48]:
adult_names = titanicDf.loc[titanicDf['Age']>=18, ['Name']]

In [49]:
adult_names.head()

Unnamed: 0_level_0,Name
PassengerId,Unnamed: 1_level_1
1,"Braund, Mr. Owen Harris"
2,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
3,"Heikkinen, Miss. Laina"
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
5,"Allen, Mr. William Henry"


In [50]:
# select passengers names older than 35 years
# NOTE: loc selects based on row or column names not id
adult_age_names = titanicDf.loc[titanicDf['Age'] > 35, ['Age', 'Name']]

In [51]:
adult_age_names.head()

Unnamed: 0_level_0,Age,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
2,38.0,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
7,54.0,"McCarthy, Mr. Timothy J"
12,58.0,"Bonnell, Miss. Elizabeth"
14,39.0,"Andersson, Mr. Anders Johan"
16,55.0,"Hewlett, Mrs. (Mary D Kingcome)"


In [52]:
# TODO: select Age and Name of all the minor passengers with age less than 18

## Updating selected fields with iloc and loc
- update first 3 rows' Name column to "anonymous"
- `iloc` uses 0-based indices for rows and columns

In [53]:
# Note: PassengerId is row index not part of column
titanicDf.iloc[0:3, 2] = 'anonymous'

In [54]:
titanicDf.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,anonymous,male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,anonymous,female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,anonymous,female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [55]:
# update Name of all the children's age < 13 to anonymous
titanicDf.loc[titanicDf['Age'] < 13, ['Name']] = 'anonymous'

In [58]:
# let's select and print just the Name column
titanicDf.loc[titanicDf['Age'] < 13, ['Name']]

Unnamed: 0_level_0,Name
PassengerId,Unnamed: 1_level_1
8,anonymous
11,anonymous
17,anonymous
25,anonymous
44,anonymous
...,...
828,anonymous
832,anonymous
851,anonymous
853,anonymous


## Creating new columns derived from existing columns

![](./images/05_newcolumn_1.svg)
- similar to adding just another Series with the column name as the key in DataFrame dictionary
- the calculation of the values is done **element_wise**
- remember, broadcast method?
    - you don't need to use loop to iterate each of the rows
- syntax:

```python
df['new_column_name'] = pd.Series()
```

## Open Air Quality Data
- OpenAQ Data - [https://openaq.org/#/](https://openaq.org/#/)
- http://dhhagan.github.io/py-openaq/tutorial/api.html#openaq-api
https://py-openaq.readthedocs.io/en/latest/

```bash
pip install py-openaq
```

### Let's use air quality data provided by OpenAQ API

In [115]:
! pip install py-openaq

Collecting py-openaq
  Using cached py-openaq-1.1.0.tar.gz (7.9 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: py-openaq
  Building wheel for py-openaq (setup.py) ... [?25ldone
[?25h  Created wheel for py-openaq: filename=py_openaq-1.1.0-py3-none-any.whl size=9037 sha256=32ed2edd58fdd7095eee7abfc206dfcdd011f341cb44c19985440065ef607113
  Stored in directory: /Users/rbasnet/Library/Caches/pip/wheels/b0/87/d2/6824c8ea805b5ed5de4993b7d728b490d98d6b51abc54ddaf1
Successfully built py-openaq
Installing collected packages: py-openaq
Successfully installed py-openaq-1.1.0


In [10]:
import openaq
import json

In [4]:
api = openaq.OpenAQ()

In [15]:
# get the cities data
# resp = api.cities(df=True, limit=10000) # doesn't work anymore!
data = api.cities(limit=1000) # just get raw data first and convert it into DataFrame

In [16]:
# Tple of results; 'result' in index 1 has the list of records we need!
data

(200,
 {'meta': {'name': 'openaq-api',
   'license': '',
   'website': '/',
   'page': 1,
   'limit': 1000,
   'found': 2475,
   'pages': 3},
  'results': [{'country': 'TW',
    'city': ' ',
    'count': 4863951,
    'locations': 66},
   {'country': 'JP', 'city': ' ', 'count': 24985118, 'locations': 1530},
   {'country': 'CN', 'city': ' ', 'count': 142407, 'locations': 1},
   {'country': 'US', 'city': '007', 'count': 20445, 'locations': 4},
   {'country': 'US', 'city': '015', 'count': 1025, 'locations': 1},
   {'country': 'US', 'city': '037', 'count': 13693, 'locations': 1},
   {'country': 'US', 'city': '039', 'count': 6662, 'locations': 2},
   {'country': 'US', 'city': '047', 'count': 21484, 'locations': 4},
   {'country': 'US', 'city': '057', 'count': 2721, 'locations': 1},
   {'country': 'US', 'city': '059', 'count': 3134, 'locations': 1},
   {'country': 'US', 'city': '069', 'count': 2227, 'locations': 1},
   {'country': 'IT', 'city': 'A2A', 'count': 8568, 'locations': 5},
   {'coun

In [18]:
pollutionDF = pd.DataFrame(data[1]['results'])

In [19]:
pollutionDF

Unnamed: 0,country,city,count,locations
0,TW,,4863951,66
1,JP,,24985118,1530
2,CN,,142407,1
3,US,007,20445,4
4,US,015,1025,1
...,...,...,...,...
995,PL,Jedlina-Zdrój,15771,1
996,US,JEFFERSON,200375,5
997,US,Jefferson City,33649,1
998,PL,Jelenia Góra,251296,1


In [21]:
pollutionDF.head()

Unnamed: 0,country,city,count,locations
0,TW,,4863951,66
1,JP,,24985118,1530
2,CN,,142407,1
3,US,7.0,20445,4
4,US,15.0,1025,1


In [22]:
# add a new column using existing column
pollutionDF['pollution_per100'] = pollutionDF['count']/100

In [23]:
# let's see the new column
pollutionDF.head()

Unnamed: 0,country,city,count,locations,pollution_per100
0,TW,,4863951,66,48639.51
1,JP,,24985118,1530,249851.18
2,CN,,142407,1,1424.07
3,US,7.0,20445,4,204.45
4,US,15.0,1025,1,10.25


In [24]:
# rename column headers
renamedDF = pollutionDF.rename(
    columns={
        'country': 'C_Code',
        'name' : 'C_Name',
    }
)

In [25]:
# see all the column names
pollutionDF.columns

Index(['country', 'city', 'count', 'locations', 'pollution_per100'], dtype='object')

In [26]:
renamedDF.columns

Index(['C_Code', 'city', 'count', 'locations', 'pollution_per100'], dtype='object')

In [27]:
renamedDF.head()

Unnamed: 0,C_Code,city,count,locations,pollution_per100
0,TW,,4863951,66,48639.51
1,JP,,24985118,1530,249851.18
2,CN,,142407,1,1424.07
3,US,7.0,20445,4,204.45
4,US,15.0,1025,1,10.25


## Combine data from multiple tables
- `pd.concat()` performs concatenatoins operations of multiple tables along one of the axis (row-wise or column-wise)
- typically row-wise concatenation is a common operation
- `concat` is general function provided in pandas module
    - https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html?highlight=pandas%20concat#pandas.concat

![](./images/08_concat_row1.svg)

In [28]:
# make a deep copy of dataframe/table
renamedDF1 = renamedDF.copy(deep=True)

In [29]:
renamedDF1.shape

(1000, 5)

In [30]:
# let's concatenate the two into a single table
combinedDF = pd.concat([renamedDF, renamedDF1], axis=0)

In [31]:
combinedDF.shape

(2000, 5)

## Join tables using a common identifier
- merge tables column-wise
- the figures below show a left-join

![](./images/08_merge_left.svg)

- can use `pd.merge()` general function provided in pandas module
    - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
- DataFrame class also provides merge method
- merge method provides how parameter to do various types of joins
    - 'left', 'right', 'outer', 'inner', 'cross', default 'inner'
    - `left`: use only keys from left frame, similar to a SQL left outer join; preserve key order.
    - `right`: use only keys from right frame, similar to a SQL right outer join; preserve key order.
    - `outer`: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
    - `inner`: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
    - `cross`: creates the cartesian product from both frames, preserves the order of the left keys.
- `pandas.concat([df1, df2], axis=0)` is equivalent to `union` in SQL

![](./images/sqlJoins_7.webp)

In [132]:
# create a DF with key column lkey
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})

In [133]:
df1

Unnamed: 0,lkey,value
0,foo,1
1,bar,2
2,baz,3
3,foo,5


In [134]:
# create a DF with key colum rkey
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

In [135]:
df2

Unnamed: 0,rkey,value
0,foo,5
1,bar,6
2,baz,7
3,foo,8


In [136]:
# cross join
df1.merge(df2, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,value_x,rkey,value_y
0,foo,1,foo,5
1,foo,1,foo,8
2,foo,5,foo,5
3,foo,5,foo,8
4,bar,2,bar,6
5,baz,3,baz,7


In [137]:
# join on common column names
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})

In [138]:
df1

Unnamed: 0,a,b
0,foo,1
1,bar,2


In [139]:
df2

Unnamed: 0,a,c
0,foo,3
1,baz,4


In [140]:
# inersection or inner join
df1.merge(df2, how='inner', on='a')

Unnamed: 0,a,b,c
0,foo,1,3


In [141]:
# left join
df1.merge(df2, how='left', on='a')

Unnamed: 0,a,b,c
0,foo,1,3.0
1,bar,2,


In [142]:
# right join
df1.merge(df2, how='right', on='a')

Unnamed: 0,a,b,c
0,foo,1.0,3
1,baz,,4


In [143]:
# outer join
df1.merge(df2, how='outer', on='a')

Unnamed: 0,a,b,c
0,foo,1.0,3.0
1,bar,2.0,
2,baz,,4.0


In [144]:
no2_url = 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_no2_long.csv'
pm2_url = 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_pm25_long.csv'
air_quality_stations_url = 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_stations.csv'
air_qual_parameters_url = 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_parameters.csv'

In [145]:
air_quality_no2 = pd.read_csv(no2_url)

In [146]:
air_quality_no2.head()

Unnamed: 0,city,country,date.utc,location,parameter,value,unit
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³


In [147]:
air_quality_parameters = pd.read_csv(air_qual_parameters_url)

In [148]:
air_quality_parameters.head()

Unnamed: 0,id,description,name
0,bc,Black Carbon,BC
1,co,Carbon Monoxide,CO
2,no2,Nitrogen Dioxide,NO2
3,o3,Ozone,O3
4,pm10,Particulate matter less than 10 micrometers in...,PM10


In [149]:
# column parameter in air_quality_no2 table and id in air_quality_parameters are common
air_quality = pd.merge(air_quality_no2, air_quality_parameters, how='left', left_on='parameter', right_on='id')

In [150]:
air_quality.head(10)

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,id,description,name
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³,no2,Nitrogen Dioxide,NO2
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³,no2,Nitrogen Dioxide,NO2
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³,no2,Nitrogen Dioxide,NO2
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³,no2,Nitrogen Dioxide,NO2
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³,no2,Nitrogen Dioxide,NO2
5,Paris,FR,2019-06-20 19:00:00+00:00,FR04014,no2,25.3,µg/m³,no2,Nitrogen Dioxide,NO2
6,Paris,FR,2019-06-20 18:00:00+00:00,FR04014,no2,23.9,µg/m³,no2,Nitrogen Dioxide,NO2
7,Paris,FR,2019-06-20 17:00:00+00:00,FR04014,no2,23.2,µg/m³,no2,Nitrogen Dioxide,NO2
8,Paris,FR,2019-06-20 16:00:00+00:00,FR04014,no2,19.0,µg/m³,no2,Nitrogen Dioxide,NO2
9,Paris,FR,2019-06-20 15:00:00+00:00,FR04014,no2,19.3,µg/m³,no2,Nitrogen Dioxide,NO2


In [151]:
air_quality.tail(10)

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,id,description,name
2058,London,GB,2019-05-07 11:00:00+00:00,London Westminster,no2,21.0,µg/m³,no2,Nitrogen Dioxide,NO2
2059,London,GB,2019-05-07 10:00:00+00:00,London Westminster,no2,21.0,µg/m³,no2,Nitrogen Dioxide,NO2
2060,London,GB,2019-05-07 09:00:00+00:00,London Westminster,no2,28.0,µg/m³,no2,Nitrogen Dioxide,NO2
2061,London,GB,2019-05-07 08:00:00+00:00,London Westminster,no2,32.0,µg/m³,no2,Nitrogen Dioxide,NO2
2062,London,GB,2019-05-07 07:00:00+00:00,London Westminster,no2,32.0,µg/m³,no2,Nitrogen Dioxide,NO2
2063,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³,no2,Nitrogen Dioxide,NO2
2064,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³,no2,Nitrogen Dioxide,NO2
2065,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³,no2,Nitrogen Dioxide,NO2
2066,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³,no2,Nitrogen Dioxide,NO2
2067,London,GB,2019-05-07 01:00:00+00:00,London Westminster,no2,23.0,µg/m³,no2,Nitrogen Dioxide,NO2


## Working with textual data
- can apply all the Python string methods on text data
- let's work on Titanic dataset

In [152]:
import pandas as pd

In [153]:
titanic = pd.read_csv('data/titanic.csv', index_col="PassengerId")

In [154]:
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [155]:
# convert Names to lowercase
titanic["Name"] = titanic["Name"].str.lower()

In [156]:
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"braund, mr. owen harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"cumings, mrs. john bradley (florence briggs th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"heikkinen, miss. laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"futrelle, mrs. jacques heath (lily may peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"allen, mr. william henry",male,35.0,0,0,373450,8.05,,S


In [157]:
# Create a new column "Surname" that contains the last name by extracting the part before the comma in Name
titanic["Name"].str.split(",")

PassengerId
1                             [braund,  mr. owen harris]
2      [cumings,  mrs. john bradley (florence briggs ...
3                              [heikkinen,  miss. laina]
4        [futrelle,  mrs. jacques heath (lily may peel)]
5                            [allen,  mr. william henry]
                             ...                        
887                             [montvila,  rev. juozas]
888                      [graham,  miss. margaret edith]
889          [johnston,  miss. catherine helen "carrie"]
890                             [behr,  mr. karl howell]
891                               [dooley,  mr. patrick]
Name: Name, Length: 891, dtype: object

In [158]:
titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)

In [159]:
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"braund, mr. owen harris",male,22.0,1,0,A/5 21171,7.25,,S,braund
2,1,1,"cumings, mrs. john bradley (florence briggs th...",female,38.0,1,0,PC 17599,71.2833,C85,C,cumings
3,1,3,"heikkinen, miss. laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,heikkinen
4,1,1,"futrelle, mrs. jacques heath (lily may peel)",female,35.0,1,0,113803,53.1,C123,S,futrelle
5,0,3,"allen, mr. william henry",male,35.0,0,0,373450,8.05,,S,allen


In [160]:
# extract the passengers info with the Name that contains "henry" on board of the Titanic
titanic[titanic["Name"].str.contains("henry")]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
5,0,3,"allen, mr. william henry",male,35.0,0,0,373450,8.05,,S,allen
13,0,3,"saundercock, mr. william henry",male,20.0,0,0,A/5. 2151,8.05,,S,saundercock
53,1,1,"harper, mrs. henry sleeper (myna haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C,harper
63,0,1,"harris, mr. henry birkhardt",male,45.0,1,0,36973,83.475,C83,S,harris
160,0,3,"sage, master. thomas henry",male,,8,2,CA. 2343,69.55,,S,sage
177,0,3,"lefebre, master. henry forbes",male,,3,1,4133,25.4667,,S,lefebre
210,1,1,"blank, mr. henry",male,40.0,0,0,112277,31.0,A31,C,blank
213,0,3,"perkin, mr. john henry",male,22.0,0,0,A/5 21174,7.25,,S,perkin
223,0,3,"green, mr. george henry",male,51.0,0,0,21440,8.05,,S,green
228,0,3,"lovell, mr. john hall (""henry"")",male,20.5,0,0,A/5 21173,7.25,,S,lovell


In [161]:
# select Name of the passenger with the longest Name
titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]
# idxmax() gets the index label for which the length is the largest

'penasco y castellana, mrs. victor de satode (maria josefa perez de soto y vallejo)'

In [162]:
# replace values of "male" by "M" and values of "female" by "F" and add it as a new column
# replace method requires a dictionary to define the mapping {from: to}
titanic["Gender"] = titanic["Sex"].replace({"male": "M", "female": "F"})

In [163]:
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname,Gender
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"braund, mr. owen harris",male,22.0,1,0,A/5 21171,7.25,,S,braund,M
2,1,1,"cumings, mrs. john bradley (florence briggs th...",female,38.0,1,0,PC 17599,71.2833,C85,C,cumings,F
3,1,3,"heikkinen, miss. laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,heikkinen,F
4,1,1,"futrelle, mrs. jacques heath (lily may peel)",female,35.0,1,0,113803,53.1,C123,S,futrelle,F
5,0,3,"allen, mr. william henry",male,35.0,0,0,373450,8.05,,S,allen,M
