# Intro to Pandas

Today, you will learn about using the Python library Pandas for working with large datasets.

# Table of contents
[1. What is Pandas?](#1.-What-is-Pandas?)

[2. Pandas Functions](#2.-Pandas-Functions)
- [2.1  Initializing a DataFrame](#2.1-Initializing-a-DataFrame)
    - pd.DataFrame
- [2.2 Relabeling columns and rows](#2.2-Relabeling-columns-and-rows)
    - pd.DataFrame.index
    - pd.DataFrame.columns
- [2.3 Selecting a particular entry in a DataFrame](#2.3-Selecting-a-particular-entry-in-a-DataFrame)
    - pd.DataFrame.loc
    - pd.DataFrame.iloc
- [2.4 Selecting a range of entries in a DataFrame](#2.4-Selecting-a-range-of-entries-in-a-DataFrame)
- [2.5 Selecting entire columns or rows in a DataFrame](#2.5-Selecting-entire-columns-or-rows-in-a-DataFrame)
- [2.6 Checking the number of rows and columns in a DataFrame](#2.6-Checking-the-number-of-rows-and-columns-in-a-DataFrame)
    - pd.DataFrame.shape
- [2.7 Creating a Pivot DataFrame](#2.7-Creating-a-Pivot-DataFrame)
    - pd.DataFrame.describe()
    - pd.DataFrame.min()
    - pd.DataFrame.max()
    - pd.DataFrame.mean()
    - pd.DataFrame.median()
    - pd.DataFrame.std()
    - pd.DataFrame.quantile()
    
    
[3. Reading CSV files into a Pandas DataFrame](#3.-Reading-CSV-files-into-a-Pandas-DataFrame)

sources:

https://pandas.pydata.org/pandas-docs/stable/10min.html

https://cloudxlab.com/blog/numpy-pandas-introduction/

https://docs.scipy.org/doc/numpy/user/quickstart.html

Python libraries are sets of functions written by other people and open sourced for the general use. We do not need to worry about the code inside these functions, we just need to know how to use the functions. Libraries use library specifications (i.e., the *library spec*) to describe what functions do and what arguments the functions take. 

Today, we will learn about **Pandas** - a library which allows us to read in data in a spreadsheet format and manipulate it easily.

# 1. What is Pandas?
Pandas is one of the most widely used python libraries in data science. It provides high-performance, easy to use structures and data analysis tools. The main object in Pandas is called a DataFrame. A DataFrame is like a spreadsheet with column names and row labels.

However, a DataFrame is better than a spreadsheet because there are many functions that can be called to act on a DataFrame and provide many additional functionalities. Examples include creating pivot tables (a table that summarizes the data from another table), creating new columns based on other columns, calculating statistics on the values in the table, and plotting graphs. 

Pandas is usually imported into Python under the *pd* nickname:

In [None]:
import pandas as pd

The nickname pd allows us to call Pandas functions using the notation ***pd.FunctionName***

Pandas is similar to excel and provides a nice graphical representation of arrays.

We will go through all the most important functionalities of Pandas today. If you want to learn more, the best way to learn a new library is to go through tutorials offered online, such as these:

https://pandas.pydata.org/pandas-docs/stable/10min.html

https://pandas.pydata.org/pandas-docs/stable/tutorials.html

You can also find the name of functions available in Pandas through the Pandas online library:

https://pandas.pydata.org/pandas-docs/stable/api.html

Additionally, [Stack Overflow](https://www.stackoverflow.com) is a fantastic resource with hundreds of answer to common Python and Python libraries questions. Using google to ask questions about python usually yields links to Stack Overflow.

# 2. Pandas Functions
Today, we will learn the basic ways to manipulate Pandas DataFrames and use them to look at the Fragile Families data. 

# 2.1 Initializing a DataFrame
To create a DataFrame, we use the Pandas DataFrame function and pass a list as input to the function. DataFrames have a built-in print function that displays the table when the name of the DataFrame is called:

In [None]:
s = pd.DataFrame([1,3,5,6,8])
s

The DataFrame above has one column labelled "0" (this is the default name) with values 1,3,5,6,8. The left column in bold is the index column with the row labels (the labels start at "0" by default). 

## Exercise 1.1
Generate your own data frame from the list of all odd numbers between 1 and 9.
## Answer:

We can also create a DataFrame with multiple columns by passing in a list of lists into the DataFrame function like this:

Let's create a Python 6x4 (6 rows, 4 columns) list: the list will have 6 elements, each of which is a list with 4 elements. This is a "list of lists", or a "2D list".

In [None]:
a = [[2,3,4,5],[4,677,774,3],[402,3034,202,22],[3.4,67.8,3,8],[5,4,22,5.],[1,2,3,4]]
print("list:\n",a)

Now, let's create a DataFrame from that list


In [None]:
df = pd.DataFrame(a)
df

You can see that the DataFrame contains the same information, but in a much nicer, cleaner format.

The indexing of rows always starts at 0. So since we have 6 rows, the row labels go from 0 to 5 and since we have 4 columns, the column labels go from 0 to 3.

# 2.2 Relabeling columns and rows
We can also label rows and columns with distinct names, which helps clarify what the different rows and columns mean. To name rows we use DataFrame.index while to name columns we use DataFrame.columns:

We'll use a Python built-in function "range". https://docs.python.org/3/library/functions.html#func-range

The range(a,b) function generates a list containing all integers in the range from a to b-1.

In [None]:
new_idx = range(501,507) 
list(new_idx)

Let's set the new range to be the new index of our DataFrame.

In [None]:
df.index = new_idx
df

In [None]:
df.columns = ['A','B','C','D']
df

We could also do it all at once - generate a new DataFrame from a list of lists, with custom column and row names assigned:

In [None]:
df1 = pd.DataFrame(a, index=range(501,507), columns=['A','B','C','D'])
df1

Now that we know how to create DataFrames, let's see what are some of the things we can do with them.

# 2.3 Selecting a particular entry in a DataFrame

To select a particular entry of the DataFrame, we can use the loc function which takes in the index label and the column label.

In [None]:
print(df.loc[502,'A'])
print(df.loc[506,'B'])

Alternatively, we can use the iloc function which takes in the index position starting from 0 and the column position starting from 0. So if we want to get the same values as above, we want to pick the second row (position 1) and the first column (position 0)

In [None]:
print(df.iloc[1,0])

And for the second value we need to pick the sixth row (position 5) and the second column (position 1)

In [None]:
print(df.iloc[5,1])

## Exercise 2.3.1
Select the value 3034.0 from data_frame and save it to a variable

# 2.4 Selecting a range of entries in a DataFrame
To select values from a to b in a particular row or column we can use the semicolon operator "a:b+1".

So if we want to get the first, second and third rows in column 'A' (the first column) we can use iloc like this:

In [None]:
df.iloc[0:3,0]

## Question 2.4.1: 
Why do we get a different answer (an empty array) if we use loc instead of iloc?

In [None]:
#print(df.loc[0:2,0])

## Answer:

The correct option in this case would be 

In [None]:
df.loc[501:503,"A"]

Alternatively, we can pass a list into iloc or loc specifying what values we are interested in

In [None]:
df.iloc[[0,1,2],0]

Reminder: loc takes in the row  and column labels, while iloc takes in the row and column index starting from 0.

In [None]:
df.loc[[501,502,503],'A']

# 2.5 Selecting entire columns or rows in a DataFrame
To select an entire column, we can use square brackets:

In [None]:
df["C"]

Or, we can use a period like this:

In [None]:
df.C

We can also select multiple columns by putting the column labels inside a list (note that the order of the columns matches the order of the list entries):

In [None]:
df[["C","A"]]

Or we can also use the loc or iloc functions with semicolon ":" which means "all entries"

So, for example "df.loc[:,'C']" says "select all rows in column C"

In [None]:
df.loc[:,'C']

and "df.iloc[:,2]" says "select all rows in the third column"

In [None]:
df.iloc[:,2]

To select an entire row, we can do the same thing, but this time  we pass the semicolon as the second argument. So now "df.loc[6,:]" says "select all columns in the row labelled 6" (that is, the second row)

In [None]:
df.loc[502,:]

and "df.iloc[1,:]" says "select all rows in the second column"

In [None]:
df.iloc[1,:]

## Exercise 2.5.1
Select columns A and D and rows 504 and 505 using loc

## Exercise 2.5.2
Select columns A and D and rows 504 and 505 using iloc

# 2.6 Checking the number of rows and columns in a DataFrame
We can check the number of rows and columns in the DataFrame by calling the shape function:

In [None]:
df.shape

# 2.7 Creating a Pivot DataFrame

A pivot DataFrame provides a summary of the column values. First, recall what our DataFrame looks like:

In [None]:
df

To create the pivot DataFrame, we use the describe function like this:

In [None]:
df.describe()

We could have calculated these values ourselves. For example, for column A:

In [None]:
print(df['A'].count())
print(df['A'].mean())
print(df['A'].std())
print(df['A'].min())
print(df['A'].quantile(0.25))
print(df['A'].quantile(0.5))
print(df['A'].quantile(0.75))
print(df['A'].max())

But using the function describe is certainly easier!

# 3. Reading CSV files into a Pandas DataFrame

Another way to create a DataFrame is from a CSV files. CSV files are comma separated files and are a very common format used to save tabular (excel-like) data. To create a DataFrame from a CSV file, we use the Pandas read_csv function.

Reminder: 
* "pwd" means "display the current directory"
* ".." means "go up a directory"

In [None]:
pwd

In [None]:
background = "../ff_data/background.csv"
data_frame = pd.read_csv(background, low_memory=False)

Now that we have created a DataFrame from the CSV file, we can see how many rows and columns there are in the DataFrame using the shape function as we did before:

In [None]:
data_frame.shape

Our DataFrame has 4,242 rows where each row corresponds to a different family that took part in the study and 12,943 columns where each column represents one piece of information (a "variable" or a "feature") of the family.

Because this dataset is huge, it doesn't make sense to try and look at it all at the same time. Instead, we can display a quick peek at the DataFrame using the "data_frame.head()" function to only show the first few rows and the first and last few columns:

In [None]:
data_frame.head()

## Exercise 3.1
Select column m1intmon for rows with challengeID ranging from 1000 to 1010

## Exercise 3.2
Find out the name of the variables associated with the mother's and father's age when the child was born, then select these two columns from data_frame for families with challengeID from 30 to 45.

## Exercise 3.3
What is the mean, standard deviation, min, and max of each column in the previous exercise?

## Exercise 3.4
What is the mean, standard deviation, min, and max of each row? Hint: use google to find out how to do this!

## Exercise 3.5
Find out the name of the variables associated with the mother's and father's level of education when the child was born, then select these two columns from data_frame for families with challengeID from 980 to 1000.