# Week-11: Tutorial on Pandas (Continued)

<font size='4'>

This week, we will continue learning `pandas`.
Before delving into new lecture notes, I will revisit Quiz 3.
Comments related to your final project task 1 have been posted, please address them accordingly.

In [1]:
# 0.1
import os
import glob
import numpy as np
import pandas as pd
print(os.getcwd())

/Users/tma33/Library/CloudStorage/OneDrive-EmoryUniversity/Emory/Rollins SPH/2025/BIOS-584/python_proj


In [2]:
ptsd_dir = '{}/data/PTSD dataset.xlsx'.format(os.getcwd())
ptsd_df = pd.read_excel(ptsd_dir, sheet_name='main_dataset')
# print(ptsd_df.columns)

## 5. Cleaning data using pandas 

<font size='4'>

* Data cleaning is one of the most common but important tasks in data science.
* Pandas allows you to preprocess data for multiple uses, including but not limited to training machine learning and deep learning models.
* Always check the missingness of the dataset first!

In [3]:
# 5.0.1


### 5.1. Dropping missing values
<font size='4'>
    
* One way to deal with missing data is to simply drop it.
* This may be useful when you have plenty of data and losing a small portion won't impact the downstream analysis.
* You can use a `.dropna()` method.
* As an example, we apply this method to a copy of original dataset.
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

In [4]:
# 5.1.1


<font size='4'>

* The resulting `ptsd_df_2` ends up with no rows because `dropna()` will remove the entire row as long as there exists one missing value.
* Let's look at the dataset with its first eight columns.

In [5]:
# 5.1.2


<font size='4'>

* The original sample size is reduced from 483 to 450.
* When one variable has the missingness smaller than **10%**, it is okay to simply remove them.

In [6]:
# 5.1.3


<font size='4'>
    
* The `axis` parameter lets you specify whether you are dropping rows, or columns, with missing values.
    * The default `axis` removes the rows containing `NaN`.
    * For a two-dimensional array, use `axis=1` to remove the columns with one or more `NaN` values.
    * `inplace=True` lets you skip saving the output of `.dropna()` into a new DataFrame.
* In this case, the column number is reduced from 439 to 71.

* Of couse, we can drop both rows and columns with missing values by setting the `how` parameter.
    * `any`: If any missing values are present, drop that row or column.
    * `all`: If all values are NA, drop that row or column.

In [7]:
# 5.1.4

# In this case, when we set `all`, 
# nothing is reduced because there is no completely missing row.

### 5.2. Replacing missing values

<font size='4'>

* When the missing percentage is moderate (>= 15% in my view), dropping values may lose information and introduce bias to your effect estimation.
* Replacing missing values with other values is preferred.
* You can fill in the missing values with a summary statistics, i.e., mean value, or apply some statistical methodology to infer a number, i.e., multiple imputation.
* Relevant pandas method is named `.fillna()`.
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

In [8]:
# 5.2.1


In [9]:
# 5.2.2
# Assign the mean of pcl5week_score.completion to its 27 missing values.
# Write down your code.


### 5.3. Handling Duplicated Values

<font size='4'>

* Let's manually create some duplicates to the existing dataset.
* You can remove all duplicated rows (by default) from the DataFrame using `.drop_duplicates()` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

In [10]:
# 5.3.1


In [11]:
# 5.3.2


### 5.4. Renaming Columns

<font size='4'>

* You can use `.rename()` method to modify the column names.
* For example, we want to change `pcl5month_score.baseline` to `pcl5_score_baseline` and `pcl5week_score.completion` to `pcl5_score_completion` in `ptsd_df_3`.

In [12]:
# 5.4.1


<font size='4'>

* You can also directly assign column names as a list to the DataFrame.
* Make sure the variable order in the list is consistent with the column names.

In [13]:
# 5.4.2


<font size='4'>

* For more details, please read this checklist: https://www.datacamp.com/blog/infographic-data-cleaning-checklist

## 6. Data Analysis in Pandas

### 6.1. Summary Statistics (mean, median, and mode)

<font size='4'>
    
* `.mean()` for mean
* `.median()` for median
* `.mode()` for mode

* Similar to `np.mean(), np.median()` functions, Pandas has three methods to compute mean, median, and mode for the DataFrame.

In [14]:
# 6.1.1

# However, this only applies to continuous or ordinal columns. You need to check your results carefully.

### 6.2. Create new columns based on existing columns

<font size='4'>

* Similar to R, pandas can easily create a new column using data from existing columns.
* For example, let's create a new column `pcl_5_mean` by taking the average of `pcl_5_score_intake`, `pcl_5_score_baseline`, and `pcl_5_score_completion`.
    * This value is not clinically meaningful.

In [15]:
# 6.2.1


### 6.3. Counting using `.value_counts()`

<font size='4'>

* For categorical variables, we use `.value_counts()` method
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html

In [16]:
# 6.3.1


### 6.4. Aggregating data with `.groupby()` in pandas

<font size='4'>

* Pandas allows you to aggregate values by grouping them by specific column values using `.groupby()` method.
* Pay attention to the order of your methods.
* Use `[]` to include multiple variables.

In [17]:
# 6.4.1


In [18]:
# 6.4.2


### 6.5. Pivot tables

<font size='4'>

* Pandas enables you to calculate summary statistics as pivot tables.
* Use `pandas.pivot_table()` function
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html
* `index` is the row variable, `columns` is  the column variable, `values` are the outcome of interest after aggregation.

In [19]:
# 6.5.1


In [20]:
# 6.5.2
# You can examine multiple outcomes.


In [21]:
# 6.5.3
