<a id='back_to_top'></a>

<img src='img/_logo.JPG' alt='Drawing' style='width:2000px;'/>

# <font color=blue>3. Libraries</font>
## <font color=blue>3.4. pandas</font>
| | |
|-|-|
| | |
| <img src='https://pandas.pydata.org/static/img/pandas.svg' alt='Drawing' style='height:100px;'/> |
| | | |
`pandas` is a Python package that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. The two primary data structures of `pandas`, `Series` (1-dimensional) and `DataFrame` (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. `pandas` is built on top of the `numpy` package, and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

To use `pandas` you need to start by importing the module, using for example:

In [None]:
import pandas as pd

<font color=red><div style="text-align: right"> **Documentation for**  

[**`pandas`**](http://pandas.pydata.org/pandas-docs/stable/) </div></font>

### <font color=blue>3.4.1. Reading tabular data files</font>
`pandas` offers a number of [IO functions](http://pandas.pydata.org/pandas-docs/version/0.20/io.html),  aimed at reading tabular data files into a `DataFrame`. Options range from simple .csv or Excel files, to more advanced data structures (e.g. HDF5, Pickle, SQL).

Example:

In [None]:
# In this case, the .csv file contains a header line
# By default, the function assumes that the first line is a header
# If not, you can pass an optional argument 'header = None' that assumes the header as data
# Manually providing the header labels can be done with another optional argument 'names = [headerA, ..., headerN]'
dc = pd.read_csv('tools\design_checks.csv')
dc

If you want to inspect the `DataFrame`, namely the first 5 rows of data (default value), you can do so with the `head()` atribute function: 

In [None]:
dc.head()

There is also a method for the last 5 rows of data (default value), with the `tail()` atribute function: 

In [None]:
dc.tail()

<font color=red><div style="text-align: right"> **Documentation for**  
[**`pandas.read_csv`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)  
[**`pandas.DataFrame.head`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html)  
[**`pandas.DataFrame.tail`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) </div></font>

### <font color=blue>3.4.2. Quick statistics</font> 

In [None]:
# Calculate summary statistics
dc.describe()

In [None]:
# Using an optional parameter to the describe method to summarize only 'object' (string) columns
dc.describe(include = ['object'])

<font color=red><div style="text-align: right"> **Documentation for**  
[**`pandas.DataFrame.describe`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) </div></font>

### <font color=blue>3.4.3. Selecting a pandas Series from a DataFrame</font>  
Sometimes, one may need to select a column (`Series`) of the `DataFrame`, effectively using it as a one-dimensional `numpy.array` with axis labels . This can be done by selecting the specific `DataFrame` label within bracket notation. It is also possible to use dot notation instead of bracket notation, but some limitations arise (e.g. dot notation doesn't work if there are spaces in the `Series` name, and can't be used to define the name of a new `Series`).

Example:

In [None]:
dc['ID'].head()

### <font color=blue>3.4.4. Exploring pandas non-numeric Series</font>   

In [None]:
# Read the design check dataset
dc = pd.read_csv('tools\design_checks.csv')
dc.head()

In [None]:
# Count the non-null values, unique values, and frequency of the most common value
dc['Check'].describe()

In [None]:
# Count how many times each value in the Series occurs
dc['Check'].value_counts()

In [None]:
# Display percentages instead of raw counts
dc['Check'].value_counts(normalize = True)

In [None]:
# Display the unique values in the Series
dc['Check'].unique()

In [None]:
# Count the number of unique values in the Series
dc['Check'].nunique()

In [None]:
# compute a cross-tabulation of two Series
pd.crosstab(dc['ID'], dc['Check']).head()

<font color=red><div style="text-align: right"> **Documentation for**  
[**`pandas.Series.describe`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html)  
[**`pandas.Series.value_counts`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html  
[**`pandas.Series.unique`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html)  
[**`pandas.Series.nunique`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nunique.html)  
[**`pandas.crosstab`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html) </div></font>

### <font color=blue>3.4.5. Adding a pandas Series to a DataFrame</font>

In [None]:
# One example is to create a cumulative NEd series within the dataframe, using the cumsum method of the series
dc['cumulative_NEd_[kN]'] = dc['NEd_[kN]'].cumsum()
dc.head()

<font color=red><div style="text-align: right"> **Documentation for**  
[**`pandas.Series.cumsum`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.cumsum.html)</div></font>

### <font color=blue>3.4.6. Renaming DataFrame columns</font> 

In [None]:
# Examine the column names
dc = pd.read_csv('tools\design_checks.csv')
dc.columns

In [None]:
# Rename two of the columns by using the 'rename' method
dc.rename(columns = {'ID':'Section', 'Check':'Does it pass?'}, inplace = True)
dc.head()

In [None]:
# Replace all of the column names by overwriting the 'columns' attribute
dc_cols = ['ID', 'G [kg/m]', 'b [mm]', 't [mm]', 'A [cm2]', 'I [cm4]', 'i [cm]','NRd [kN]', 'NEd [kN]', 'Util ratio', 'Check']
dc.columns = dc_cols
dc.head()

In [None]:
# Replace the column names during the file reading process by using the 'names' parameter
dc_cols = ['ID', 'G [kg/m]', 'b [mm]', 't [mm]', 'A [cm2]', 'I [cm4]', 'i [cm]','NRd [kN]', 'NEd [kN]', 'Util ratio', 'Check']
dc = pd.read_csv('tools\design_checks.csv', header = 0, names = dc_cols)
dc.head()

In [None]:
# Replace all spaces with underscores in the column names by using the 'str.replace' method
dc.columns = dc.columns.str.replace(' ', '_')
dc.head()

<font color=red><div style="text-align: right"> **Documentation for**  
[**`pandas.DataFrame.rename`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html)  
[**`pandas.Series.str.replace`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html) </div></font>

### <font color=blue>3.4.7. Sorting a pandas Series or DataFrame</font> 
When it comes to sorting the datasets, the methods available in `pandas` do not affect the underlying data (i.e. the sorting is temporary).

In [None]:
# Read the design check dataset
dc = pd.read_csv('tools\design_checks.csv')
dc.head()

In [None]:
# Sort the 'Util_ratio' Series in ascending order (returns a Series)
dc['Util_ratio'].sort_values().head()

In [None]:
# Sort in descending order instead
dc['Util_ratio'].sort_values(ascending = False).head()

In [None]:
# Sort the entire DataFrame by the 'A_[cm2]' Series (returns a DataFrame)
dc.sort_values('A_[cm2]').head()

In [None]:
# Sort in descending order instead
dc.sort_values('A_[cm2]', ascending = False).head()

In [None]:
# Sort the DataFrame first by ascending order of 'NRd_[kN]', then by 'Util_ratio'
dc.sort_values(['NRd_[kN]', 'Util_ratio']).head()

<font color=red><div style="text-align: right"> **Documentation for**  
[**`pandas.Series.sort_values`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html)  
[**`pandas.DataFrame.sort_values`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) </div></font>

### <font color=blue>3.4.8. Filtering rows of a pandas DataFrame by column values</font>  

In [None]:
# Read the design check dataset
dc = pd.read_csv('tools\design_checks.csv')
dc.head()

In [None]:
# Filter members with 'b_[mm]' above or equal to 200mm
dc[dc['b_[mm]'] >= 200].head()

In [None]:
# Select the 'Util_ratio' Series from the filtered DataFrame
dc[dc['b_[mm]'] >= 200]['Util_ratio'].head()

In [None]:
# Or equivalently, use the 'loc' method
dc.loc[dc['b_[mm]'] >= 200]['Util_ratio'].head()

In [None]:
# Filter members with with 'b_[mm]' above or equal to 200mm, with 'Check' 'PASS'
dc[(dc['b_[mm]'] >= 200) & (dc['Check'] == 'PASS')].head()

In [None]:
# Filter members with with 'b_[mm]' above or equal to 200mm, with 'Check' 'PASS' or 'FAIL'
dc[(dc['b_[mm]'] >= 200) & (dc['Check'].isin(['PASS', 'FAIL']))].head()

<font color=red><div style="text-align: right"> **Documentation for**  
[**`pandas.DataFrame.loc`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)  
[**`pandas.Series.isin`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html) </div></font>

### <font color=blue>3.4.9. DataFrame axis</font>
When referring to rows or columns with the axis parameter:
- `axis = 0` (or `axis = 'index'`) refers to rows
- `axis = 1` (or `axis = 'columns'`) refers to columns

In [None]:
# Read the design check dataset
dc = pd.read_csv('tools\design_checks.csv')
dc.head()

In [None]:
# Remove a column (temporarily)
dc.drop('i_[cm]', axis = 1).head()

In [None]:
# Remove a column (permanently)
dc.drop('i_[cm]', axis = 1, inplace = True)
dc.head()

In [None]:
# Remove multiple columns at once
dc.drop(['Check', 'Util_ratio'], axis = 1, inplace = True)
dc.head()

In [None]:
# Remove a row (temporarily)
dc.drop(2, axis = 0).head()

In [None]:
# Remove a row (permanently)
dc.drop(2, axis = 0, inplace = True)
dc.head()

In [None]:
# Remove multiple rows at once
dc.drop([0, 1], axis = 0, inplace = True)
dc.head()

When performing a mathematical operation with the axis parameter:

- `axis = 0` means the operation should "move down" the row axis
- `axis = 1` means the operation should "move across" the column axis

In [None]:
# Calculate the mean of each numeric column
dc.mean(axis = 0)

In [None]:
# Calculate the mean of each row
dc.mean(axis = 1).head()

<font color=red><div style="text-align: right"> **Documentation for**  
[**`pandas.DataFrame.drop`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)  
[**`pandas.DataFrame.mean`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html) </div></font>

### <font color=blue>3.4.10. pandas groupby</font> 

In [None]:
import pandas as pd
# Read the design check dataset
dc = pd.read_csv('tools\design_checks.csv')
dc.head()

In [None]:
# Calculate the mean 'Util_ratio' for members with 'b_[mm]' of 200mm
dc[dc['b_[mm]'] == 200]['Util_ratio'].max()

In [None]:
# Calculate the mean 'Util_ratio' for all values of 'b_[mm]'
dc.groupby('b_[mm]')['Util_ratio'].max()

In [None]:
# Multiple aggregation functions can be applied simultaneously
dc.groupby('b_[mm]')['Util_ratio'].agg(['count', 'mean', 'min', 'max'])

<font color=red><div style="text-align: right"> **Documentation for**  
[**`pandas.DataFrame.groupby`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)    
[**`pandas.core.groupby.DataFrameGroupBy.agg`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html) </div></font>

### <font color=blue>3.4.11. Finding and removing pandas duplicates</font>

In [None]:
# Read the design check dataset
dc = pd.read_csv('tools\design_checks.csv')
dc.head()

In [None]:
# Detect duplicate 'ID' codes: True if an item is identical to a previous items
dc['ID'].duplicated().head()

In [None]:
# Count the duplicate items (True becomes 1, False becomes 0)
dc['ID'].duplicated().sum()

In [None]:
# Detect duplicate DataFrame rows: True if an entire row is identical to a previous row
dc.duplicated().head()

In [None]:
# Count the duplicate rows
dc.duplicated().sum()

In [None]:
# Drop the duplicate rows (inplace = False by default)
# In this case, there are no duplicate rows
# Paremeter 'keep' equal to 'first', 'last' and False, keeps the first, last, or neither ocurrence, respectively
dc.drop_duplicates(keep = 'first').shape

In [None]:
# Only consider a subset of columns when identifying and droping duplicates
# In this case, we know there are duplicate 'ID' values, so we can use that
dc.drop_duplicates(subset = ['ID'], keep = 'first').shape

<font color=red><div style="text-align: right"> **Documentation for**  
[**`pandas.DataFrame.drop_duplicates`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) </div></font>

### <font color=blue>3.4.12. Creating a pandas DataFrame from other objects</font>

In [None]:
# Create a DataFrame from a dictionary
# Keys become column names, values become data
pd.DataFrame({'id': ['IPE100', 'IPE200', 'IPE300'], 
              'check': ['PASS', 'FAIL', 'FAIL']})

In [None]:
# Optionally, you can specify the order of columns and define the index
pd.DataFrame({'id': ['IPE100', 'IPE200', 'IPE300'], 
              'check': ['PASS', 'FAIL', 'FAIL']},
             columns = ['check', 'id'],
             index = ['a', 'b', 'c'])

In [None]:
# Create a DataFrame from a list of lists
# Each inner list becomes a row
pd.DataFrame([['IPE100', 'PASS'], 
              ['IPE200', 'FAIL'], 
              ['IPE300', 'FAIL']], 
             columns = ['id', 'check'])

In [None]:
# create a DataFrame of member IDs (101 through 200) and check binary (random integer 0 and 1)
import numpy as np
pd.DataFrame({'member': np.arange(101, 201, 1), 
              'util_ratio': np.random.randint(0, 2, 100)}).head()

In [None]:
# 'set_index' can be chained with the DataFrame constructor to select an index
import numpy as np
pd.DataFrame({'member': np.arange(101, 201, 1), 
              'util_ratio': np.random.randint(0, 2, 100)}).set_index('member').head()

<font color=red><div style="text-align: right"> **Documentation for**  
[**`pandas.DataFrame`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)  
[**`pandas.DataFrame.set_index`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html) </div></font>

[Back to top](#back_to_top)