<a href="https://colab.research.google.com/github/tmckim/materials-sp24-colab/blob/main/lab/lab10/lab10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Before you start - Save this notebook!

When you open a new Colab notebook from the WebCampus (like you hopefully did for this one), you cannot save changes. So it's  best to store the Colab notebook in your personal drive `"File > Save a copy in drive..."` **before** you do anything else.

The file will open in a new tab in your web browser, and it is automatically named something like: "**Copy of lab10.ipynb**". You can rename this to just the title of the assignment "**lab10.ipynb**". Make sure you do keep an informative name (like the name of the assignment) so that you know which files to submit back to WebCampus for grading! More instructions on this are at the end of the notebook.


**Where does the notebook get saved in Google Drive?**

By default, the notebook will be copied to a folder called “Colab Notebooks” at the root (home directory) of your Google Drive. If you use this for other courses or personal code notebooks, I recommend creating a folder for this course and then moving the assignments AFTER you have completed them. <br>

I also recommend you give the folder where you save your notebooks^ a different name than the folder we create below that will store the notebook resources you need each time you work through a course notebook. This includes any data files you will need, links to the images that appear in the notebook, and the files associated with the autograder for answer checking.<br>
You should select a name other than '**NS499-DataSci-course-materials**'. <br>
This folder gets overwritten with each assignment you work on in the course, so you should **NOT** store your notebooks in this folder that we use for course materials! <br><br>For example, you could create a folder called 'NS499-**notebooks**' or something along those lines.

__________

### Import and Setup Steps
If you restart colab, you must rerun all **5** steps in each of these cells!

In [None]:
# Step 1
# Setup and add files needed to access gdrive
from google.colab import drive                                   # these lines mount your gdrive to access the files we import below
drive.mount('/content/gdrive', force_remount=True)

In [None]:
# Step 2
# Change directory to the correct location in gdrive (modified way to do this from before)
import os
os.chdir('/content/gdrive/MyDrive/NS499-DataSci-course-materials/')

In [None]:
# Step 3
# Remove the files that were previously there- we will replace with all the old + new ones for this assignment
!rm -r materials-sp24-colab

In [None]:
# Step 4
# These lines clone (copy) all the files you will need from where I store the code+data for the course (github)
# Second part of the code copies the files to this location and folder in your own gdrive
!git clone https://github.com/tmckim/materials-sp24-colab '/content/gdrive/My Drive/NS499-DataSci-course-materials/materials-sp24-colab/'

In [None]:
# Step 5
# Change directory into the folder where the resources for this assignment are stored in gdrive (modified way from before)
os.chdir('/content/gdrive/MyDrive/NS499-DataSci-course-materials/materials-sp24-colab/lab/lab10/')

In [None]:
# Import packages and other things needed
# Don't change this cell; Just run this cell
# If you restart colab, make sure to run this cell again after the first ones above^

import pandas as pd
from datascience import *
import numpy as np
import seaborn as sns                 # this one is new!
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

## Conversion Notebook: From `datascience` library to Python's `pandas` library

Throughout this course, we have been working with the `datascience` library, a library created by faculty at UC Berkeley. While this library is not used outside of teaching in these courses, all of the ideas and concepts behind the library and the different functions are definitely used when dealing with data science problems in the real world. <br><br>
One of the common libraries used in industry is called `pandas`, and is a way to structure and analyze rectangular/tabular data. Using the `datascience` library in this course is a solid stepping stone to understanding `pandas` better. Throughout this notebook, we will go over certain concepts that we saw in the `datascience` library and showing the equivalent functions that we will use in `pandas`. <br>The syntax and function names may be different but the underlying concepts are still the same!

Above, we `import pandas as pd`, which means that any function associated with pandas should be called using `pd.function_name()`. This tells Python that we want to use the specific `function_name` from the Pandas library. In theory, we could do `import pandas as pandas` or any other name, but it is known and commonly used to import it as `pd`. We will see some examples of this later in this notebook.

For reference:

Datascience documentation: http://data8.org/datascience/index.html

Python Reference: http://data8.org/python-reference/python-reference.html

Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/dsintro.html

## **Learning objectives:**


*   Work with the package `pandas` 🐼
*   Compare and contrast similarities and differences between `datascience` and `pandas` ✅ ❌
*   Apply skills learned from one package to another 💻
*   Find documentation and resources when questions arise ❓
*   Practice skills from this course so far with a published neuroscience dataset 🧠



---

### Tables and DataFrames

In `datascience`, we have something called a Table, which is a way to organize your data in a tabular format, which makes accessing rows and columns of data easier. <br>
In `pandas`, this structure is called a DataFrame. A DataFrame is the primary data structure in `pandas`. <br>Similar to a Table, we can access different rows and columns of a DataFrame. Tables are essentially the same as DataFrames: we can do similar actions and functions with both.

In the following lines of code, we create a Table and DataFrame of data by importing and reading an external [csv file](https://en.wikipedia.org/wiki/Comma-separated_values). The Table will be called `cones_table` and the DataFrame will be called `cones_df`.

---
### About the dataset

We will use a published neuroscience dataset. Citation for the data is: <br>
Marcus, D. S., Wang, T. H., Parker, J., Csernansky, J. G., Morris, J. C., & Buckner, R. L. (2007). Open Access Series of Imaging Studies (OASIS): Cross-sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults. Journal of Cognitive Neuroscience, 19(9), 1498–1507. https://doi.org/10.1162/jocn.2007.19.9.1498 <br>

You can also download the paper directly from [here](https://drive.google.com/uc?export=view&id=1AvvCH-CCJFKgntMEj00W2PwpLEIXBxL9) if you'd like to review it (not necessary though).

This dataset consists of a cross-sectional collection of 416 subjects aged 18 to 96. The subjects are all right-handed and include both men and women. 100 of the included subjects over the age of 60 have been clinically diagnosed with very mild to moderate Alzheimer’s disease (AD). Additionally, a reliability data set is included containing 20 nondemented subjects imaged on a subsequent visit within 90 days of their initial session. The data is also available to download via [Kaggle](https://www.kaggle.com/datasets/jboysen/mri-and-alzheimers)

Here is an image from the paper that explains some of the variables in the dataset: <br>

![](DataVariableTable.png)


In addition, there are other variables including: <br>
`ID`: individual unique identifier <br>
`Hand`: indicates dominant hand. Should all be 'R' for right-handed <br>
`Delay`: time between sessions (number of days) for the subset of subjects that returned for a reliabity session

Clinical Dementia Rating (CDR) classification was as follows: <br>


*   0 = Normal
*   0.5 = Very Mild Dementia
*   1 = Mild Dementia
*   2 = Moderate Dementia





In [None]:
# datascience
dementia_table = Table.read_table('oasis_cross-sectional.csv') # read the csv file into the notebook

In [None]:
# Reivew the data in the table
dementia_table

In [None]:
# pandas
dementia_df = pd.read_csv('oasis_cross-sectional.csv') # read the csv file into the notebook

In [None]:
# Review the data in the dataframe
dementia_df

In [None]:
# Review the type to show dataframe from pandas package
type(dementia_df)

We see here how to import an external csv file into our current notebook using both libraries. With `pandas` `read_csv()` function, we use `pd.read_csv()` to let Python know we want to use this particular `pandas` function.
Notice how there is an extra column of numbers in the DataFrame that results. These numbers are called the index of a DataFrame, and will be useful later in order to select different rows that we are interested in.

In `pandas`, we have two types of objects: DataFrames and Series. Series are similar to a DataFrame except there is only one column in a Series. You can think of Series as a single column of a DataFrame. Later in this notebook, we will use Series as well as DataFrames. Series come up when we select one column of a DataFrame, apply different functions to a column, etc.

You can learn more about DataFrames and Series with this [documentation](https://pandas.pydata.org/pandas-docs/stable/dsintro.html) as well.

In [None]:
# Another thing to do when you first look at your new data, is to see what info is there
# This is possible with .info() after the name of your data frame like so:
print(dementia_df.info())

We have 436 rows (observations) and 12 columns. In the column labeled `Column`, the names of each column are listed and it shows how many non-null values are present, meaning how much data there is. Any difference between this number and the total number of entries (436 shown at the top) indicates there is missing data. The last column `Dtype` tells us what type of data you have in each column. You can see here that many are numbers and have no decimals (`int`) or have decimal values (`float64`). Note that in Pandas, strings are labeled as `object` (first three entries).

We can see that we do have missing data here. Particularly in the `Delay` column because we only have 20 values. So we are missing 416 (436 - 20) values. If you go back up and review the table, you will see that missing values are labeled with `NaN`, which stands for `Not a Number`

---
### Make a Table/DataFrame
If we want to create a new Table or DataFrame from scratch, we can call the respective functions and then specify the columns and the data within each column (where the data is in array/dict format). Some examples from both `datascience` and `pandas` are shown below.

In [None]:
# datascience
# create Table with 2 columns and 3 rows
flowers = Table().with_columns(
    'Number of petals', make_array(8, 34, 5),
    'Name', make_array('lotus', 'sunflower', 'rose')
)
flowers

In [None]:
# pandas
# create DataFrame with 2 columns and 3 rows
flowers_df = pd.DataFrame(data = {'Number of petals': [8, 34, 5], 'Name': ['lotus', 'sunflower', 'rose']})
flowers_df

In [None]:
# Note how we have added the data using a new python data type
new_type = {'Number of petals': [8, 34, 5], 'Name': ['lotus', 'sunflower', 'rose']}
type(new_type)

---
### Dictionaries

Here we have a new type of data structure that we haven't covered in this class. This is called a dictionary. Dictionaries store data in key:value pairs. Also note that we have a new type of brackets or braces that are curly `{ }`

<br>

In the example above, the key: value pairs are: <br>
`{'Number of petals': [8,34,5],` <br>
`'Name': ['lotus','sunflower', 'rose']}` <br>
The keys are the strings for the column names, and the values are the data that go in the rows of the table. <br><br>

Dictionaries are structures which can contain multiple data types. For each unique key, the dictionary has one value. Keys can be various data types: strings, numbers, or tuples, while the corresponding values can be any Python object.<br>

You **cannot** access values of the dictionary by the indexes (like you can in lists or arrays). But you can access them by the key. Due to this feature dictionaries don't allow duplicated keys.

You can also access just the keys or just the indexes by `.keys()` and `.values()` methods.



In [None]:
# Create a dictionary called participant
participant = {'name': 'Jon Doe', 'group': 'Control', 'age': 42}
print(participant['name'])

In [None]:
# How would you reference the other keys like in the print statement above? try group or age

In [None]:
# add new key-value pair to the dictionary
participant['ID'] = 'CJD'
print(participant)

In [None]:
# method to access keys
participant.keys()

In [None]:
# method to access values
participant.values()


 To find out more, read the Python [documentation](https://docs.python.org/3/tutorial/datastructures.html#dictionaries). <br>

Other examples and [tutorial](https://www.w3schools.com/python/python_dictionaries.asp)

---
### Select Columns
In both `datascience` and `pandas` we can select a column or multiple columns, based on what information we want from the data. In `datascience` we use the `select()` function, while in `pandas` we use indexing with brackets `[]`.

In [None]:
# datascience
# select one columns
dementia_table.select('Age')

In [None]:
# datascience
# select multiple columns
dementia_table.select('Age','M/F')

In [None]:
# pandas
# select one column - return Series
dementia_df["Age"]

In [None]:
# pandas
# select one column - return DataFrame
dementia_df[["Age"]]

In [None]:
# pandas
# select a specific column
dementia_df.Age

In [None]:
# pandas
# select multiple columns
dementia_df[["Age", "M/F"]]

If you want to view all columns in your dataframe in `pandas` you can also use the following method.

In [None]:
# pandas
dementia_df.columns

We can also use `loc` and `iloc` to select specific columns, which we will introduce in the next section.

---
### Select Rows
In `datascience`, we use `take` in order to select certain rows, based on what row numbers we want to select (0 indexed). With selecting multiple rows, we use the concept of list slicing in order to select a sequence of rows, or even select multiple rows (if the numbers are in an array).

In [None]:
# datascience
dementia_table.take(2) # select the row with an index of 2

In [None]:
dementia_table.take(np.arange(1, 3)) # select the rows with index from 1 to 3 - remember, the last # will NOT be included

In [None]:
dementia_table.take[0, 3, 4] # select the rows with index 0, 3, 4

In [None]:
# pandas
dementia_df[0:3] # select the rows with index from 0 to 3 (not including 3)

What we did above is called *slicing*. Like we learned about what to put into `np.arange`, the input you can enter has a similar format. For example, the structure looks like this: <br>
`[start_index : end_index : step]` <br>

Note that here we use a colon `:` instead of a comma `,` to separate the input values.

In our example, note that slicing also only selects the values up to the `end_index`, but does not include that value. That is why we got rows from 0 to 2, but not the third. Remember we index starting from 0. <br>
Also note that we did not enter a third argument/input to slice. We omitted `step`, which has a default value of 1 and does not need to be entered if that is the option you'd like to use.

---
#### loc and iloc

With `pandas`, we can use `loc` and `iloc` to select certain rows or columns from a DataFrame - all at once - making it a powerful tool. <br>
`loc` (stands for `loc`ation) gets rows or columns with particular labels from the index. <br>
`iloc` (stands for `i`nteger `loc`ation) gets rows or columns at particular positions in the index (so it only takes integers). <br><br>
With both of these, we index using brackets and specify which rows and columns we want to select based on `[row, column]`. <br>
Note: if we have just `:` for either row or column, this means that we select all of it (based on if it is in the row or column section of the brackets). An example is shown below.

Here is a helpful image from this [website](https://www.shanelynn.ie/pandas-iloc-loc-select-rows-and-columns-dataframe/)

![](pd_iloc.png)


In [None]:
dementia_df.loc[:,'Age'] # select all rows but only the column Age

In [None]:
dementia_df.loc[:,['Age', 'M/F']] # select all rows but only the Age and M/F columns

In [None]:
dementia_df.iloc[1:3,0:3] # select the rows with index from 1 to 3 (not including 3) and columns in the positions of 0 to 3 (not including 3)

---
### Rename Columns

In the `datascience` package, we can rename a column or multiple columns. We may want to do this when we update a column to make the title more specific. We also need to make sure that the new column name is not the same as any other existing columns (so we do not have two columns with the same name as this will error). We do this with the `relabeled()` function.

In [None]:
dementia_table.relabeled('Educ', 'Education') # rename the Educ column as Education

In [None]:
dementia_table.relabeled(['Educ', 'SES'], ['Education', 'Socioeconmoic Status']) # rename the Educ column as Education and SES and Socioeconomic Status

In `pandas` we can use a similar function called `rename()`.

In [None]:
dementia_df.rename(columns = {'Age': 'New Age'})  # rename the Age column as New Age

In [None]:
dementia_df.rename(columns={'SES': 'Socioeconomic Status', 'Educ': 'Education'}) # rename SES as Socioeconomic Status and Educ as Education

---
### Where/Filtering

In `datascience` there is a function called `where` which creates a copy of a table with only the rows that match some condition. To filter out a DataFrame by its contents in `pandas`, we need to use boolean (`True` or `False`) expressions in order to select the rows we want to keep.

In [None]:
# datascience
dementia_table.where("Age", are.above(75)) # select/filter rows so the Age is above 75

We want to keep the rows that have a price that is greater than 4. We can first create a boolean array, which will assign a `True` or `False` value to each row based on if it satisfies the condition we give it. We can then index this array into our original array and only the values that are `True` will be returned in a DataFrame.


In [None]:
# pandas
boolean_array = dementia_df["Age"] > 75 # create a boolean array saying if the Age is greater than 75
boolean_array

In [None]:
dementia_df[boolean_array] # apply boolean array to DataFrame and filter the rows

We can also do this all in one step as shown below.


In [None]:
# All in one step
dementia_df[dementia_df["Age"] > 75] # filter the DataFrame based on Age greater than 75

In [None]:
# Note the above created a new dataframe and did not replace our original- all data is still here
dementia_df

Sometimes we want to filter by multiple different conditions. In `pandas` we can do this using parentheses and a symbol indicating if we want both conditions to be satisfied (and) or at least one to be satisfied (or). The format is:

`df[(condition1) & (condition2)]`

or

`df[(condition1) | (condition2)]`

Note: `&` is different from `and`
also `|` is different from `or`.

In [None]:
dementia_df[(dementia_df["Age"] > 30) & (dementia_df["Age"] < 75)] # filter the DataFrame based on Age greater than 30 and less than 75

In [2]:
#@title Task
from IPython.display import HTML

alert_info = '''
<div style= "font-size: 20px"; class="alert alert-info" role="alert">
  <h4 class="alert-heading">Task</h4>
You run an experiment with two groups: control (`control`) and treatment (`treatment`). You want to filter out some participants from the treatment group who don’t meet the minimum BMI criteria (BMI should be equal to or greater than 15). <br>
Does this participant meet this criterion?
</div>
'''

display(HTML('<link href="https://nbviewer.org/static/build/styles.css" rel="stylesheet">'))
display(HTML(alert_info))

In [None]:
# Replace the code where you see ...
age = 30
group = 'control'
BMI = 20

condition = (group ... "treatment") ... ( .... >= 15)
print(condition)

In [3]:
#@title Task
from IPython.display import HTML

alert_info = '''
<div style= "font-size: 20px"; class="alert alert-info" role="alert">
  <h4 class="alert-heading">Task</h4>
Now you want to be more sophisticated (for whatever reason). You update your criteria for the treatment group. You want to keep the participant if they are older than 40 or their BMI equals or greater than 15.<br>
Does this participant fit the updated conditions?
</div>
'''

display(HTML('<link href="https://nbviewer.org/static/build/styles.css" rel="stylesheet">'))
display(HTML(alert_info))

In [None]:
# Replace the code where you see ...
age = 30
group = 'control'
BMI = 20

condition = ((BMI ... 15) ... (age ... 40)) ... (group ... "treatment")
print(condition)

---
### Sort, Group, Pivot
#### Sort
We can sort values similarly in Tables and DataFrames - using a function and what column we want to sort by.

In [None]:
# datascience
dementia_table.sort('Age') # sort the table by Age column

In [None]:
# pandas
dementia_df.sort_values('Age') # sort DataFrame by Age column

In [None]:
# pandas
dementia_df.sort_values('Age', ascending=False) # sort DataFrame by Age column

If we want to specify whether we sort in ascending order or descending order. In `datascience` we can do this with the `descending` parameter, and setting it equal to `True` or `False`. <br> In `pandas` we can use the `ascending` parameter and set this equal to `True` or `False`. By default, the column sorts in ascending order.

#### Group
In both `datascience` and `pandas` we can use group functions that allow us to group records of our data into buckets. You can think of grouping as splitting the dataset data into buckets. Then you can call "aggregate" functions (`mean`, `sum`, `max`, `min`, etc) on these buckets to find these values per bucket (which can lead to interesting analysis)!

Let's say that we want to group by flavor of ice cream and see what the total sum is.

In [None]:
# datascience
dementia_table.select(['Age', 'M/F']).group('M/F', collect=np.average) # select the Age and M/F columns and then group by M/F and find the average Age per group

In [None]:
# pandas
dementia_df[['Age', 'M/F']].groupby('M/F').mean() # select the Age and M/F columns and then group by M/F and find the average Age per group

In this example, we want to see the **minimum** estimated total intracranial volume (eTIV) and **average** normalized whole-brain volume (nWBV) for each gender and clinical dementia rating (CDR). The following line of code might seem complicated, but run it to see the output and then we will break down the steps afterward.


In [None]:
dementia_df.groupby(["CDR", "M/F"]).agg({"eTIV": "min", "nWBV": "mean"})

1. `groupby` a list of column names. We only used one column above, but you can add more than one like we did here.
2. After we use `groupby`, we apply the aggregation method `.agg()` and specify a dictionary in a following way: `{column_name: aggregation function}`. We applied multiple functions at once on different columns.

**Notice**:  If you wanted to apply multiple functions on the same column you could specify a list, for example, `{"eTIV: ['min','max']}`. Try it out!

#### Pivot
We can create pivot tables in both `datascience` and `pandas` using different functions and specifying columns, index, values, and the collect/aggregate function acting on the values.

In [None]:
# datascience
dem_pivot = dementia_table.pivot(columns = 'M/F', rows = 'Age', values = 'MMSE', collect=np.mean) # create a pivot table with Flavors and Color and sum prices for corresponding entries
sorted = dem_pivot.sort('Age',descending = True) # sorted because most values are filled in for higher Ages, younger participants have lots of NaNs
sorted.show()

In [None]:
# pandas
dementia_df.pivot_table(values='MMSE', index=['Age'], columns=['M/F'], aggfunc=np.average) # create a pivot table with Flavors and Color and sum prices for corresponding entries

Note: We have `NaN` as values in our above table because `pandas` cannot find appropriate values for those specific combinations of rows and columns. If we want to replace these values with 0, we can use `fillna(0)` on the resulting pivot table.

Don't be alarmed by these `NaN` values! This is something to note about real data, often times the data is not cleaned already and null values are very common. They may even be important in your exploration of the data, the number of null values you have and where they occur could be important!

---
### Visualizing Data


 **Seaborn** is a plotting package that works with matplotlib to more easily adjust the aesthetics of plots.


Read the [introduction](https://seaborn.pydata.org/tutorial/introduction.html) to the package, up to and including the **Statistical estimation section**. After reading, return to the notebook and continue exploring this package with the guided prompts below.

We already imported it at the beginning as: <br>
`import seaborn as sns` <br>
We use `sns` as the abbreviation because just like `numpy` is `np`, this is commonly used.<br> <br>

#### Barplots
Let's start by working with barplots. <br>
Seaborn documentation:
https://seaborn.pydata.org/generated/seaborn.barplot.html

You only need a few options at minimum to create a barplot.<br>


1.   The data to use: `data = dataframe_name`
2.   x values based on a column name from the dataframe: `"CDR"`
3.   y values based on a column name from the dataframe: `"Age"`



In [None]:
# Let's plot CDR (x-axis) by Age (y-axis)
sns.barplot(
    data=dementia_df,          # start by identifying our dataframe
    x="CDR",                   # column name from our dataset to use for x-values
    y="Age");                  # column name from our dataset to use for y-values

You can see that it automatically adds a few things for you. This includes labels for your axes based on the column names that were used. It also adds errorbars based on the 95% confidence interval that it determines based on the data (you don't see this part- it just does it). It chooses a default color pallete here, so we have some interesting (bright!) color choices.

In [None]:
# Let's adjust our plot and add a few things - just run this cell
sns.barplot(
    data=dementia_df,          # start by identifying our dataframe
    x="CDR",                   # column name from our dataset to use for x-values
    y="Age",                   # column name from our dataset to use for y-values
    errorbar="sd",             # always a good idea to add error bars to our mean values
    color="royalblue")         # specify a color- many options!
plt.title("Age (Mean ± SD)");  # this is from matplotlib and adds text for a title


In [None]:
# Similar plot, but:
# add Biological Sex (M/F) as another grouping factor- *hue* option below

sns.barplot(                   # this is all part of the seaborn
    data=dementia_df,          # start by identifying our dataframe
    x="CDR",                   # column name from our dataset to use for x-values
    y="Age",                   # column name from our dataset to use for y-values
    hue="M/F",                 # this is optional- we can color code by another variable
    errorbar="sd",             # always a good idea to add error bars to our mean values
    color="lightblue")         # lots of options for colors
plt.title("CDR Score By Age (Mean ± SD)");  # this is from matplotlib and adds text for a title
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0);  # you don't need to know this, but it puts the legend outside the box so it doesn't overlap with data


In [4]:
#@title Task
from IPython.display import HTML

alert_info = '''
<div style= "font-size: 20px"; class="alert alert-info" role="alert">
  <h4 class="alert-heading">Task</h4>
Adjust the code for the plotting below based on what was given above. Make sure to pay attention to which variables you are plotting. Each one is different!
</div>
'''

display(HTML('<link href="https://nbviewer.org/static/build/styles.css" rel="stylesheet">'))
display(HTML(alert_info))

In [None]:
# Now use the code above, but plot CDR (x-axis) by Years of Education (y-axis)
# Keep Biological Sex (M/F) as another grouping factor- include *hue* option

sns.barplot(                   # this is all part of the seaborn
...                            # add several lines of code like above here
)
plt.title();                   # adjust your title too!
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0);  # you don't need to know this, but it puts the legend outside the box so it doesn't overlap with data

In [None]:
# Now plot CDR (x-axis) by Socioeconomic Status (y-axis)
# Keep Biological Sex (M/F) as another grouping factor- include *hue* option

sns.barplot(                   # this is all part of the seaborn
...                            # add several lines of code like above here
)
plt.title();                   # adjust your title too!
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0);  # you don't need to know this, but it puts the legend outside the box so it doesn't overlap with data

In [None]:
# Now plot CDR (x-axis) by Mini-Mental State Exam Score (y-axis)
# Keep Biological Sex (M/F) as another grouping factor- include *hue* option

sns.barplot(                   # this is all part of the seaborn
...                            # add several lines of code like above here
)
plt.title();                   # adjust your title too!
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0);  # you don't need to know this, but it puts the legend outside the box so it doesn't overlap with data

Here's an example of how you could automate the creation of each plot from above and make them 'subplots' of your figure using a `for` loop.

In [None]:
columns_to_plot = ["Age", "Educ", "SES", "MMSE"]

plt.figure(figsize=(10,7), facecolor="white")

for (i, colname) in enumerate(columns_to_plot):
    plt.subplot(2,2,i+1)
    sns.barplot(
        data=dementia_df, x="CDR", y=colname,
        hue="M/F", errorbar="sd", color="lightblue")
    plt.title(f"{colname} (Mean ± SD)")
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0);  # you don't need to know this, but it puts the legend outside the box so it doesn't overlap with data

plt.tight_layout()
plt.show()

Here we could also use `groupby` to help us grab the actual numbers that appear in the plot. This is a good idea to check that things match when you are working with your code. The values should be what you expect based on the data.

In [None]:
# Calculate summary statistics to review and compare to plot
print("Summary statistics:")

# get the numerical values
summary_stats = dementia_df.groupby(by=["CDR", "M/F"]).agg(
    {"ID": "count", "Age": ["mean", "std"], "Educ": ["mean", "std"],
     "SES": ["mean", "std"], "MMSE": ["mean", "std"]}).round(2)

display(summary_stats)

In [6]:
#@title Task
from IPython.display import HTML

alert_info = '''
<div style= "font-size: 20px"; class="alert alert-info" role="alert">
  <h4 class="alert-heading">Task</h4>
Answer the following question: <br>
Which of the following statements is correct based on the data in the table and the plots we made above: <br><br>

1.   There are 2 people in the dataset with moderate dementia and they are both male.<br>
2.   Healthy controls (no dementia) have more variability in MMSE score compared to patients with dementia.<br>
3.   Age and dementia status (CDR) are positively correlated.<br>
</div>
'''

display(HTML('<link href="https://nbviewer.org/static/build/styles.css" rel="stylesheet">'))
display(HTML(alert_info))

*Type the number for your answer here*

#### Scatterplots


The nice thing about the seaborn package is that it makes it easy to add or modify a simple scatterplot. This helps to create nice visualizations! <br>
`seaborn` scatterplots: https://seaborn.pydata.org/generated/seaborn.scatterplot.html

We will start by plotting the relationship between estimated total intracranial volume (`eTIV`) and normalized whole-brain volume (`nWBV`). <br>

Again, you only need a few basic things to define your plot: `dataframe`,`x-values`, and `y-values`.

In [None]:
# Let's start simple like we've already been doing alot in this class
# Plot the relationship between eTIV and nWBV
sns.scatterplot(
    data=dementia_df,
    x="eTIV",
    y="nWBV");


But what if we also want to look by sex (`M/F`) like we did above in our barplots. We can easily add to our plot! <br>

To do this, we use the `hue` option and you can set the color `palette` that you want.

In [None]:
# Now add hue and color palette we want
sns.scatterplot(
    data=dementia_df,
    x="eTIV",
    y="nWBV",
    hue = "M/F",                          # we add a hue to group our data by this column from our data
    palette = "magma");                   # choose colors for our points


We can also even add another variable that changes the size of the dots based on the atlas scaling factor (`ASF`) from our data.<br>
This is the `size` option. I've additionally added some values to make the sizes more apparent.

In [None]:
# Add size - review the legend to see how this relates to the values in the column we selected
sns.scatterplot(
    data=dementia_df,
    x="eTIV",
    y="nWBV",
    hue="M/F",
    size = 'ASF',                       # add another variable to group our data by- we adjust size based on value of this column
    sizes=(20, 200),
    palette="magma");

And finally, if want to add one more option to our plot, we can also split the plots into separate panels (columns) according to the `CDR`. <br>
Note that we do have to switch to a different plot type, and it is similar to `scatterplot` but it has more options than this one! <br>

This is called `relplot` for relative plot. You can plot the relationships between many variables of interest like shown below.<br> <br>

`seaborn relplot` documentation: https://seaborn.pydata.org/generated/seaborn.relplot.html

In [None]:
sns.relplot(
    data=dementia_df,
    x="eTIV",
    y="nWBV",
    col="CDR",          # split by columns by group
    hue="M/F",          # color points according to the group
    size="ASF",         # change the size of a point according to the value
    sizes=(5, 500),     # scale of the points
    palette = 'magma',  # set your color
    col_wrap=2          # split to two columns
)
plt.show()

In [8]:
#@title Task
from IPython.display import HTML

alert_info = '''
<div style= "font-size: 20px"; class="alert alert-info" role="alert">
  <h4 class="alert-heading">Task</h4>
Answer the following question: <br>
Based on the plots above, which is true? <br><br>

1.   There is a strong positive relationship between eTIV and nWBV among different groups <br>
2.   On average, females have greater total intracranial volume <br>
3.  On average, values of the atlas scaling factor (ASF) decrease as the estimated total intracranial volume (eTIV) increases <br>
</div>
'''

display(HTML('<link href="https://nbviewer.org/static/build/styles.css" rel="stylesheet">'))
display(HTML(alert_info))

*Type the number for your answer here*

This demonstrates that there may be multiple ways (`scatterplot` vs. `relplot`) to use functions within packages to produce plots and visualize data!

#### Histograms

The final plot type we will review will be histograms. We've been using these frequently to visualize the results of our simulations during hypothesis testing.

`seaborn` histplot: https://seaborn.pydata.org/generated/seaborn.histplot.html

In [None]:
# Create a simple plot of the count (frequency) of Age of participants in our sample
sns.histplot(
    data=dementia_df,
    x ="Age");

In [None]:
# What if we try to also plot by M/F?
# Notice this is a little harder to tell because the proportions are overlaid
sns.histplot(
    data=dementia_df,
    x ="Age",
    hue = "M/F",                  # try to separate by this grouping variable
    stat = "proportion");         # added another option to demonstrate we can change this to proportion

In [None]:
# Let's try this another way
sns.histplot(
    data=dementia_df,
    x ="Age",
    hue = "M/F",
    multiple = "dodge",             # This option separates out our data better for visualizing
    stat = "proportion");

Above we used the option called `dodge`. This is a common option if you are trying to show your data but don't want overlap. For example, if you show all your participants as dots and some overlap, it may be hard to see how many there are or other important variables you want to demonstrate with your data. You can use `dodge` as a way to shift them slightly so there is less overlap. They will still be plotted at the correct location on the graph for the variables of interest, they are usually just moved in a direction that does not meaningfully change the data.

For all of the plots above, there are many different parameters you can play around with (color, size, orientation, figure size, axis labels, title, etc). Above are the basic implementations of these graphs but we recommend looking through the documentation and seeing how you can change different aspects of it.

If you were wondering, you can also create all of these types of plots with `pandas`.

`pandas` barplots: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.plot.bar.html

`pandas` histograms: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.hist.html

`pandas` scatterplots: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.scatter.html

---
### Calculate the number of columns
In `datascience`, calculating the number of columns (or rows) requires a simple call using our Table, either `num_columns` or `num_rows`. With `pandas`, there are two ways to get the number of columns and rows. The `len` function can be used on a part of the DataFrame, or the `shape` function can be used. The `shape` function returns both the number of rows and the number of columns so based on what we want, we will have to select it using indexing.

In [None]:
# datascience
dementia_table.num_columns

In [None]:
# pandas
len(dementia_df.columns) # find the length of the list of column names

In [None]:
# pandas
dementia_df.shape # gives both row, column lengths. output is always in the order of (row, column)

In [None]:
# pandas
dementia_df.shape[1] # select the column part of shape

---
### Calculate the number of rows

In [None]:
# datascience
dementia_table.num_rows

In [None]:
# pandas
len(dementia_df) # number of rows in the DataFrame

In [None]:
# pandas
dementia_df.shape[0] # select the row part of shape

---
### Apply
In both libraries, we have a function called `apply`, which we use to apply a function on a certain column and all its elements. <br>Note: in `pandas`, apply works on a Series (since this is essentially a single column of a DataFrame).

In [None]:
# Let's write our own function that takes the input of the 'M/F' column and converts it to a number
def convert_str_to_num(x):
    if x == 'M':
      return 1
    elif x == 'F':
      return 0

In [None]:
# datascience- apply the function
dementia_table.apply(convert_str_to_num, "M/F")

In [None]:
# Compare the output to the original
dementia_table.column('M/F')

In [None]:
# pandas
dementia_df["M/F"].apply(convert_str_to_num, "M/F") # apply our function to the column M/F

Above we see that we get the converted values returned to us. If we wanted to change the column in the original Table/DataFrame, we would set the expression above equal to the column/series so that the change is made and saved in the original.

For example:

`
dementia_table = dementia_table.apply(convert_str_to_num, "M/F")
`

or

`
dementia_df["M/F"] = dementia_df["M/F"].apply(convert_str_to_num, "M/F")
`

---
### Joining
Joins are useful when we want to combine two or more tables together, so we can do analysis on all of the tables. In `datascience`, we can use `join` and in `pandas` we can use `merge`. With both we need the information of the two tables and how we are joining on them (the appropriate columns).

In [None]:
# datascience
table = Table().with_columns('first', make_array('i', 'c', 'c', 'a'), 'second', make_array('a', 'b', 'b', 'j'), 'third', make_array('c', 'd', 'e', 'f'))
table2 = Table().with_columns( 'another', make_array('i', 'a', 'a', 'a'), 'fourth', make_array('a', 'b', 'b', 'j'), 'fifth', make_array('c', 'd', 'e', 'f'))
print(table)
print()
print(table2)
table.join('first', table2, 'fourth') # join table and table 2 together so columns first and fourth match values

This took my brain a bit to understand, so here is an image that might be helpful. Also, read the text below to check that it matches your understanding of how the tables were joined above.

![](TableJoinExample.png)

In the above code block, we create two different tables. When we call the `join()` function on these two tables, we take the cross product of the rows of both tables (every combination of rows that could happen between both tables) and then filter this out based on the columns we specify that need to be the same. In this case we have row `first` from the first table and row `fourth` from the second table, therefore we can only keep rows that have a value in `first` from the first table and `fourth` from the second table that are the same. In our example, the similar value is `first` and `fourth` is the value `a` (we see it in the last row of the `first` column and the first row of the `fourth` column). We then look at the values in these two rows and create a table with the values. Similarly, we can replicate this in `pandas`:

In [None]:
df = pd.DataFrame(data = {'first': ['i', 'c', 'c', 'a'], 'second': ['a', 'b', 'b', 'j'], 'third': ['c', 'd', 'e', 'f']})
df2 = pd.DataFrame(data = {'another': ['i', 'a', 'a', 'a'], 'fourth': ['a', 'b', 'b', 'j'], 'fifth': ['c', 'd', 'e', 'f']})
print(df)
print()
print(df2)
df.merge(df2, left_on = 'first', right_on = 'fourth', how = "inner") # merge df and df2 using an inner join using columns first and fourth

In the `pandas` code above, we have an extra argument that we need to specify: how. There are many types of joins that we can do with data. In `datascience`, the only option we can do is an inner join, but in `pandas`, we have the option to do inner, left, right, outer joins. For more information about these types of joins, check out this page: http://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.DataFrame.merge.html

Here is another example of join and a schematic that hopefully helps with this concept.
![](TableJoinExample2.png)

In [None]:
table = Table().with_columns('a', make_array(9, 3, 3, 1),
    'b', make_array(1, 2, 2, 10),
    'c', make_array(3, 4, 5, 6))
table

In [None]:
table2 = Table().with_columns( 'a', make_array(9, 1, 1, 1),
'd', make_array(1, 2, 2, 10),
'e', make_array(3, 4, 5, 6))
table2

In [None]:
table.join('a', table2)

![](https://github.com/tmckim/materials-fa23-colab-working/blob/main/lab/lab08/TableJoinExample2.png?raw=1)

---
### Export to CSV

When we want to convert a Table to a csv file, we first need to convert it to a DataFrame and then to a csv file. In `pandas`, we can get rid of this intermediate step because we already have a DataFrame!

In [None]:
dementia_table.to_df().to_csv('dementia_datascience.csv', index = False)

In [None]:
dementia_df.to_csv('dementia_dataframe.csv', index = False)

---
### Reading `pandas` Documentation
There are many more functions and methods you can call with `pandas` to do more cool things with Series and DataFrames. One way to learn more about this is by looking through the `pandas` documentation. The documentation has all the different functions associated with `pandas`, and descriptions about what they do, how you use them, and some examples.

Some tips for reading through the documentation:
* For various functions there are LOTS of different parameters that you can call, usually there are only a few that are important (usually related to the data you are working with and specifying how to run the functions). There are some parameters that are optional and you do not have to specify (automatically Pandas will use default settings for these functions). For example, for joins, an inner join is the default setting but if needed, you can say you want to do an outer join etc.
* Another useful parameter is the `inplace` parameter. When set to true, the data is renamed/the function is run in place (a copy of the new data with the function applied is NOT created)
* For quick lookup for a specific function: in the notebook you can put your cursor on a Pandas function and hover over it until the documentation appears. You can then open the documentation as new tab on the side of the browswer window.

## Lab Complete

In [None]:
# Run this cell for fun
from IPython.display import HTML
HTML('<img src="https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExdHFvM3pkNWR3azBhZTJ6OHVxcnRmYjYzM3Rqc2w0dWM0aHY1dHp6YSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/CjmvTCZf2U3p09Cn0h/giphy.gif">')


### **Important submission steps:**
1. Choose **Save** (and make sure you've already saved a copy in your drive) from the **File** menu.
2. You will make sure your notebook file is saved in the following steps.
3. You will submit the notebook for this assignment to the corresponding Assignment on the WebCampus (Canvas) course website.

**It is your responsibility to make sure your work is saved before following the instructions in the last cell.**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output.
**Please save (or check again) before exporting!**
You will save the notebook file (.ipynb):


1.   Go to `"File > Download"` and choose the **.ipynb format** (first option)
  - This will save a copy of the python notebook file- extension .ipynb- in the Downloads folder on your computer (or wherever you have opted to save files)


2. If the above option is not available to you, make sure to use ctrl + s on a pc (press both keys at same time, do not include the + sign) or command + s (press both keys at same time, do not include the + sign) for apple devices. Look at the top of the Menu in google colab, and toward the middle, it might say that changes were saved.
  * If you want to check that things were saved recently, go to your Google drive (via an online browser or from the app) and check the timestamp for when your notebook was last updated. If it wasn't saved recently, go back to the tab where you have your notebook open and resave.
  * The notebook file `"Copy of lab10.ipynb"` will be in your google drive under the `"Colab Notebooks"` folder. (see info at top for more on where things get saved)

## Credits

Data example and some code adapted from [Ruslan Klymentiev](https://pyforneuro.com/).