# Intro to Pandas
- [https://pandas.pydata.org/](https://pandas.pydata.org/)
- a fast, powerful, flexible and easy to use open source data analysis and manipulation tool built on Python

## What kind of data does pandas handle?

### pandas data table representation
![img](images/Pandas-Table.svg)
- to work with pandas package, must start with importing the package
- to install/update pandas you can use either conda or pip

```bash
conda install pandas
pip install pandas
```

In [4]:
import pandas as pd

In [5]:
pd.__version__

'1.1.1'

## DataFrame
- data table in pandas is called DataFrame
- Python dict can be used create DataFrame where keys will be used as column headers and the list of values as columns of the DataFrame
- each column of DataFrame is called `Series`

In [7]:
aDict = {
    "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth"
    ],
    "Age": [22, 35, 58],
    "Sex": ["male", "male", "female"]
}

In [8]:
df = pd.DataFrame(aDict)

In [10]:
print(df)

                       Name  Age     Sex
0   Braund, Mr. Owen Harris   22    male
1  Allen, Mr. William Henry   35    male
2  Bonnell, Miss. Elizabeth   58  female


### spreadsheet data
- the above df can be represented in a spreadsheet software
![SpreadSheet](./images/01_table_spreadsheet.png)

In [11]:
# just work with the data in column Age
df["Age"]

0    22
1    35
2    58
Name: Age, dtype: int64

In [13]:
# do something with dataframe
print(df.describe())

             Age
count   3.000000
mean   38.333333
std    18.230012
min    22.000000
25%    28.500000
50%    35.000000
75%    46.500000
max    58.000000


In [14]:
print(df['Age'].max())

58


In [17]:
# print first 2 rows
print(df.head(2))

                       Name  Age   Sex
0   Braund, Mr. Owen Harris   22  male
1  Allen, Mr. William Henry   35  male


In [18]:
# print last 2 rows
print(df.tail(2))

                       Name  Age     Sex
1  Allen, Mr. William Henry   35    male
2  Bonnell, Miss. Elizabeth   58  female


In [19]:
df.dtypes

Name    object
Age      int64
Sex     object
dtype: object

## Read and write tabular data

![](./images/02_io_readwrite1.svg)

### Titanic Data
- https://github.com/pandas-dev/pandas/blob/master/doc/data/titanic.csv

```
PassengerId: Id of every passenger.

Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.

Pclass: There are 3 classes: Class 1, Class 2 and Class 3.

Name: Name of passenger.

Sex: Gender of passenger.

Age: Age of passenger.

SibSp: Indication that passenger have siblings and spouse.

Parch: Whether a passenger is alone or have family.

Ticket: Ticket number of passenger.

Fare: Indicating the fare.

Cabin: The cabin of passenger.

Embarked: The embarked category.
```
- use pandas `.read_*(fileName)` to read data from various formats

In [21]:
# let's read titanic.csv file
titanicDf = pd.read_csv('data/titanic.csv')

In [23]:
print(titanicDf)

     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ... 

In [24]:
# print first 8 rows
print(titanicDf.head(8))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   
5                                   Moran, Mr. James    male   NaN      0   
6                            McCarthy, Mr. Timothy J    male  54.0      0   
7                     Palsson, Master. Gosta Leonard    mal

In [25]:
titanicDf.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [27]:
# write the DataFrame as Excel spreadsheet
# need openpyxl library
titanicDf.to_excel('data/titanic.xlsx', sheet_name='passengers')

In [28]:
# technical summary of DataFrame
titanicDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## select a subset of a DataFrame

![](./images/03_subset_columns.svg)

In [29]:
# copy just the Age column or Series
ages = titanicDf['Age']

In [30]:
type(ages)

pandas.core.series.Series

In [31]:
ages.shape

(891,)

In [32]:
# get age and sex columns
age_sex = titanicDf[['Age', 'Sex']]

In [34]:
print(age_sex.head())

    Age     Sex
0  22.0    male
1  38.0  female
2  26.0  female
3  35.0  female
4  35.0    male


In [35]:
age_sex.shape

(891, 2)

In [36]:
# passengers older than 35
titanicDf['Age'] > 35

0      False
1       True
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool

In [37]:
# Create the Df with selection
passengers = titanicDf[titanicDf['Age']>35]

In [40]:
print(passengers.describe)

<bound method NDFrame.describe of      PassengerId  Survived  Pclass  \
1              2         1       1   
6              7         0       1   
11            12         1       1   
13            14         0       3   
15            16         1       2   
..           ...       ...     ...   
865          866         1       2   
871          872         1       1   
873          874         0       3   
879          880         1       1   
885          886         0       3   

                                                  Name     Sex   Age  SibSp  \
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
6                              McCarthy, Mr. Timothy J    male  54.0      0   
11                            Bonnell, Miss. Elizabeth  female  58.0      0   
13                         Andersson, Mr. Anders Johan    male  39.0      1   
15                    Hewlett, Mrs. (Mary D Kingcome)   female  55.0      0   
..                                 

In [43]:
# another example of selection
class_23 = titanicDf[titanicDf["Pclass"].isin([2, 3])]

In [45]:
print(class_23.head())

   PassengerId  Survived  Pclass                            Name     Sex  \
0            1         0       3         Braund, Mr. Owen Harris    male   
2            3         1       3          Heikkinen, Miss. Laina  female   
4            5         0       3        Allen, Mr. William Henry    male   
5            6         0       3                Moran, Mr. James    male   
7            8         0       3  Palsson, Master. Gosta Leonard    male   

    Age  SibSp  Parch            Ticket     Fare Cabin Embarked  
0  22.0      1      0         A/5 21171   7.2500   NaN        S  
2  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S  
4  35.0      0      0            373450   8.0500   NaN        S  
5   NaN      0      0            330877   8.4583   NaN        Q  
7   2.0      3      1            349909  21.0750   NaN        S  


In [47]:
# select data where age is known
age_no_na = titanicDf[titanicDf['Age'].notna()]

In [49]:
print(age_no_na.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [50]:
age_no_na.shape

(714, 12)

In [53]:
# select passengers names older than 35 years
adult_names = titanicDf.loc[titanicDf['Age'] > 35, 'Name']

In [54]:
adult_names.head()

1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object

In [56]:
# select rows 10-25 and columns 3-5
print(titanicDf.iloc[9:25, 2:5])

    Pclass                                               Name     Sex
9        2                Nasser, Mrs. Nicholas (Adele Achem)  female
10       3                    Sandstrom, Miss. Marguerite Rut  female
11       1                           Bonnell, Miss. Elizabeth  female
12       3                     Saundercock, Mr. William Henry    male
13       3                        Andersson, Mr. Anders Johan    male
14       3               Vestrom, Miss. Hulda Amanda Adolfina  female
15       2                   Hewlett, Mrs. (Mary D Kingcome)   female
16       3                               Rice, Master. Eugene    male
17       2                       Williams, Mr. Charles Eugene    male
18       3  Vander Planke, Mrs. Julius (Emelia Maria Vande...  female
19       3                            Masselmani, Mrs. Fatima  female
20       2                               Fynney, Mr. Joseph J    male
21       2                              Beesley, Mr. Lawrence    male
22       3          

## updating selected elements with iloc
- update first 3 names to "anonymous"

In [57]:
titanicDf.iloc[0:3, 3] = "anonymous"

In [58]:
titanicDf.head()['Name']

0                                       anonymous
1                                       anonymous
2                                       anonymous
3    Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                        Allen, Mr. William Henry
Name: Name, dtype: object

## creating new columns derived from existing columns

![](./images/05_newcolumn_1.svg)

## OpenAQ API
- http://dhhagan.github.io/py-openaq/tutorial/api.html#openaq-api
https://py-openaq.readthedocs.io/en/latest/

```bash
pip install py-openaq
```

### Let's use air quality data provided by OpenAQ API

In [60]:
import openaq

In [62]:
api = openaq.OpenAQ()

In [63]:
pollutionDF = api.cities(df=True, limit=1000)

  data = pd.io.json.json_normalize(resp)


In [64]:
pollutionDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   country    1000 non-null   object
 1   name       1000 non-null   object
 2   city       1000 non-null   object
 3   count      1000 non-null   int64 
 4   locations  1000 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 39.2+ KB


In [65]:
print(pollutionDF.head())

  country                name                city   count  locations
0      AD  Escaldes-Engordany  Escaldes-Engordany  204662          2
1      AD              unused              unused   16301          1
2      AE           Abu Dhabi           Abu Dhabi  436140          1
3      AE               Dubai               Dubai  433111          1
4      AE                 N/A                 N/A   48205          2


In [68]:
pollutionDF['pollution_per100'] = pollutionDF['count']/100

In [71]:
print(pollutionDF.head())

  country                name                city   count  locations  \
0      AD  Escaldes-Engordany  Escaldes-Engordany  204662          2   
1      AD              unused              unused   16301          1   
2      AE           Abu Dhabi           Abu Dhabi  436140          1   
3      AE               Dubai               Dubai  433111          1   
4      AE                 N/A                 N/A   48205          2   

   pollution_per100  
0           2046.62  
1            163.01  
2           4361.40  
3           4331.11  
4            482.05  


In [76]:
# rename column headers
renamedDF = pollutionDF.rename(
    columns={
        'country': 'C_Code',
        'name' : 'C_Name',
    }
)

In [75]:
# see all the column names
pollutionDF.columns

Index(['country', 'name', 'city', 'count', 'locations', 'pollution_per100'], dtype='object')

In [77]:
renamedDF.columns

Index(['C_Code', 'C_Name', 'city', 'count', 'locations', 'pollution_per100'], dtype='object')

In [79]:
print(renamedDF.head())

  C_Code              C_Name                city   count  locations  \
0     AD  Escaldes-Engordany  Escaldes-Engordany  204662          2   
1     AD              unused              unused   16301          1   
2     AE           Abu Dhabi           Abu Dhabi  436140          1   
3     AE               Dubai               Dubai  433111          1   
4     AE                 N/A                 N/A   48205          2   

   pollution_per100  
0           2046.62  
1            163.01  
2           4361.40  
3           4331.11  
4            482.05  


## Combine data from multiple tables
- `pd.concat()` performs concatenatoins operations of multiple tables along one of the axis (row-wise or column-wise)
- typically row-wise concatenation is a common operation

![](./images/08_concat_row1.svg)

In [80]:
# make a deep copy of dataframe/table
renamedDF1 = renamedDF.copy(deep=True)

In [82]:
print(renamedDF1.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   C_Code            1000 non-null   object 
 1   C_Name            1000 non-null   object 
 2   city              1000 non-null   object 
 3   count             1000 non-null   int64  
 4   locations         1000 non-null   int64  
 5   pollution_per100  1000 non-null   float64
dtypes: float64(1), int64(2), object(3)
memory usage: 47.0+ KB
None


In [83]:
# let's concatenate the two into a single table
combinedDF = pd.concat([renamedDF, renamedDF1], axis=0)

In [84]:
combinedDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   C_Code            2000 non-null   object 
 1   C_Name            2000 non-null   object 
 2   city              2000 non-null   object 
 3   count             2000 non-null   int64  
 4   locations         2000 non-null   int64  
 5   pollution_per100  2000 non-null   float64
dtypes: float64(1), int64(2), object(3)
memory usage: 109.4+ KB


## join tables using a common identifier
- merge tables column-wise; a left-join

![](./images/08_merge_left.svg)

In [104]:
no2_url = 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_no2_long.csv'
pm2_url = 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_pm25_long.csv'
air_quality_stations_url = 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_stations.csv'
air_qual_parameters_url = 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_parameters.csv'

In [86]:
air_quality_no2 = pd.read_csv(no2_url)

In [88]:
print(air_quality_no2.head())

    city country                   date.utc location parameter  value   unit
0  Paris      FR  2019-06-21 00:00:00+00:00  FR04014       no2   20.0  µg/m³
1  Paris      FR  2019-06-20 23:00:00+00:00  FR04014       no2   21.8  µg/m³
2  Paris      FR  2019-06-20 22:00:00+00:00  FR04014       no2   26.5  µg/m³
3  Paris      FR  2019-06-20 21:00:00+00:00  FR04014       no2   24.9  µg/m³
4  Paris      FR  2019-06-20 20:00:00+00:00  FR04014       no2   21.4  µg/m³


In [94]:
air_quality_parameters = pd.read_csv(air_qual_parameters_url)

In [95]:
print(air_quality_parameters.head())

     id                                        description  name
0    bc                                       Black Carbon    BC
1    co                                    Carbon Monoxide    CO
2   no2                                   Nitrogen Dioxide   NO2
3    o3                                              Ozone    O3
4  pm10  Particulate matter less than 10 micrometers in...  PM10


In [98]:
# column parameter in no2 table and id in parameters are common
air_quality = pd.merge(air_quality_no2, air_quality_parameters, how='left', left_on='parameter', right_on='id')

In [103]:
print(air_quality.head(10))

    city country                   date.utc location parameter  value   unit  \
0  Paris      FR  2019-06-21 00:00:00+00:00  FR04014       no2   20.0  µg/m³   
1  Paris      FR  2019-06-20 23:00:00+00:00  FR04014       no2   21.8  µg/m³   
2  Paris      FR  2019-06-20 22:00:00+00:00  FR04014       no2   26.5  µg/m³   
3  Paris      FR  2019-06-20 21:00:00+00:00  FR04014       no2   24.9  µg/m³   
4  Paris      FR  2019-06-20 20:00:00+00:00  FR04014       no2   21.4  µg/m³   
5  Paris      FR  2019-06-20 19:00:00+00:00  FR04014       no2   25.3  µg/m³   
6  Paris      FR  2019-06-20 18:00:00+00:00  FR04014       no2   23.9  µg/m³   
7  Paris      FR  2019-06-20 17:00:00+00:00  FR04014       no2   23.2  µg/m³   
8  Paris      FR  2019-06-20 16:00:00+00:00  FR04014       no2   19.0  µg/m³   
9  Paris      FR  2019-06-20 15:00:00+00:00  FR04014       no2   19.3  µg/m³   

    id       description name  
0  no2  Nitrogen Dioxide  NO2  
1  no2  Nitrogen Dioxide  NO2  
2  no2  Nitrogen Dioxid

In [102]:
print(air_quality.tail(10))

        city country                   date.utc            location parameter  \
2058  London      GB  2019-05-07 11:00:00+00:00  London Westminster       no2   
2059  London      GB  2019-05-07 10:00:00+00:00  London Westminster       no2   
2060  London      GB  2019-05-07 09:00:00+00:00  London Westminster       no2   
2061  London      GB  2019-05-07 08:00:00+00:00  London Westminster       no2   
2062  London      GB  2019-05-07 07:00:00+00:00  London Westminster       no2   
2063  London      GB  2019-05-07 06:00:00+00:00  London Westminster       no2   
2064  London      GB  2019-05-07 04:00:00+00:00  London Westminster       no2   
2065  London      GB  2019-05-07 03:00:00+00:00  London Westminster       no2   
2066  London      GB  2019-05-07 02:00:00+00:00  London Westminster       no2   
2067  London      GB  2019-05-07 01:00:00+00:00  London Westminster       no2   

      value   unit   id       description name  
2058   21.0  µg/m³  no2  Nitrogen Dioxide  NO2  
2059   21.