<a href="https://colab.research.google.com/github/tb-harris/neuroscience-2024/blob/main/prework/4_Pandas_Fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas

`pandas` is a Python library that helps us work with categorical and numerical data.


To use this library, we first have to **import** it with the keyword `import`. We can follow our import statement with **as** to give Pandas a shorter name that we can reference throughout your code.

**Important: If you don't run the code below, none of the pandas functions used later in this notebook will work!**

In [None]:
import pandas as pd

## Read data
With `pandas` imported, we can read in .csv files with the `pandas` function `read_csv()`.

In that function, we can specify the file we want to use with a URL or with the path to a local file as a string.

This saves the data in a structure called a DataFrame.

[Click here](https://raw.githubusercontent.com/DeisData/python/master/data/gapminder.csv) to see the csv (comma-separated values) file with the data we will be using in this notebook.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/DeisData/python/master/data/gapminder.csv") # read in data

Our data is now saved as a data frame in Python as the variable `df`. With the data now in the environment, we can take a look at the first few rows with `df.head()`.

In [None]:
df.head()

We can see that this data frame has several different columns, with information about countries and demography.

*If a "View recommended plots" button shows up after running the above code, try using it to see some of the different ways this data can be visualized! This is a new feature of Colab that helps to suggest code to plot data.*

## Summarize data frame

It is important to understand the data we are working with before we begin analysis. First, let's look at the dimenions of the data frame using `df.shape`. It gives the number of rows by the number of columns.

In [None]:
df.shape

This shows that our data frame has 14740 rows by 9 columns.

We can also use `df.columns` to display the column names.

In [None]:
df.columns

### Categorical variables
Next, let's summarize the categorical, non-numerical variables. For instance, we can identify how many unique regions we have in the data set.

First, to select a column, we use the notation `df['COLUMN_NAME']`.

In [None]:
df['region']

To identify unique entries in this column, we can use the `.unique()` function. We can also use `.nunique()` to find the number of unique columns in the dataset.

In [None]:
df['region'].unique()

In [None]:
df['region'].nunique()

### Question 1
Write code to determine how many unique countries are in the dataset.

In [None]:
# Your code here

### Numerical variables

Numerical columns can be summarized in several ways. Let's find the mean first.

To make things simpler, we'll just do calculations on the `population`, `life_expectancy`, and `babies_per_woman` columns. We can put those names in a `list` and then specify that list for the columns.

In [None]:
target_columns = [ 'population', 'life_expectancy', 'babies_per_woman' ] # numerical columns

df[target_columns]

With this set of columns, we can run `.mean()` to find the mean of each column.

In [None]:
df[target_columns].mean() # returns the mean of each column

If we want a larger variety of summary statistics, we can use the `.describe()` method.

In [None]:
df[target_columns].describe()

### Question 2
Print out the summary statistics for columns `age5_surviving`, `gdp_per_day`, and `gdp_per_capita`.

In [None]:
### your code below:


## Manipulate data

### Accessing rows and specific entries

You can also to access a specific row using `df.loc[ROW, :]`. The colon specifies to select all columns for that row number. **Try changing the code below to access other rows.**

In [None]:
df.loc[0, :] # the first row

We can use `.loc` to find the value of specific entries, as well.

In [None]:
df.loc[0, 'country'] # first row entry for column

### Question 3
Get the row with index 100 from the dataframe.

In [None]:
# Your code below:


### Subset by row

Sometimes, we want to create a subset of the main data frame based on certain conditions. We do this by using `df.loc` and specifying a condition for the rows.

Below, we take all of the rows where `year` is greater than or equal to 2000 with `df['year'] >= 2000` and assign this to a new data frame.

In [None]:
# take all rows where year is greater than or equal to 2000 and create a new dataframe
data_21st_century = df.loc[df['year'] >= 2000, :]
data_21st_century.head() # prints out the first few rows of our new dataframe

We can now analyze this subset of data on its own. For example, we could get the mean life expectancy across all entries in the 21st century.

In [None]:
# Get summary statistics for all numerical columns of our 21st century-only dataframe
data_21st_century['life_expectancy'].mean()

We can use the following operators to make subsets:
- Equals: `==`
- Not equals: `!=`
- Greater than, less than: `>`, `<`
- Greater than or equal to: `>=`
- Less than or equal to: `<=`

We can also subset with categorical variables. Here, we take all rows where the country is Hungary.

In [None]:
df_hungary = df.loc[df['country'] == 'Hungary', :] # create a new dataframe with just the rows where 'country' is 'Hungary'
df_hungary.head() # prints out the first few rows of the new Hungary dataframe

We can now analyze just the entires associated with Hungary by, for example, getting summary statistics for the `population` and `life_expectancy` columns.

In [None]:
# Get summary statistics for life expectancy and % surviving to age 5 in df_hungary
columns = ["population", "life_expectancy"]
df_hungary[columns].describe()

### Question 4

Create a subset of data from Lithuania.

In [None]:
### Your code here


Then, calculate the mean `age5_surviving` of the subset. You should get the value *90.8027901234568*

In [None]:
### Your code here


### Question 5

Create a subset of data from the `year` 2005.

In [None]:
### Your code here


Then, calculate summary staistics for `life_expectancy` and `gdp_per_capita`. The count of both columns in the summary statistics should be *182*.

In [None]:
### Your code here


### Question 6
Follow the steps below to find out which countries have had at least one year with a life expectancy of 80 or above.

Create the subset of data such that `life_expectancy` is 80 or above.

In [None]:
### Your code here


Get a list of unique `country` entries in the dataframe using [.unique()](https://colab.research.google.com/drive/1lN_ZRqzQiv1IjLwIqfcCBz5zO_7zkhox#scrollTo=bGoWQ21hIxiT).

In [None]:
### Your code here

## Resources

This notebook is adapted from the [Brandeis Library Python Programming Workshop](https://deisdata.github.io/python/) created by Ford Fishman.

- [NumPy docs](https://numpy.org/doc/stable/index.html)
- [NumPy getting started](https://numpy.org/doc/stable/user/quickstart.html)
- [Random samples with NumPy](https://numpy.org/doc/stable/reference/random/index.html)
- [Pandas docs](https://pandas.pydata.org/docs/)
- [Pandas getting started](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)
- [Pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [PySpark for big data](https://spark.apache.org/docs/latest/api/python/)

This lesson is adapted from
[Software Carpentry](http://swcarpentry.github.io/python-novice-gapminder/design/).