## Notebook Outcomes

In this notebook we will learn:
<ul>
    <li>what a common python data handling package is</li> 
    <li>the basic pandas data structures</li> 
    <li>useful pandas dataframe funtionality</li> 
    <li>subsetting/searching a pandas dataframe</li> 
    <li>getting pandas to generate descriptive statistics</li> 
    <li>using pandas to read in data</li>
    <li>using pandas to save data</li>
</ul>

# `pandas`

`pandas` is one of the most popular data handling packages in `python`. We'll go over the minimum you'll need to know about the package for the boot camp in this notebook.

Let's start by importing the package.

In [2]:
# It is standard practice to import
# pandas as pd
import pandas as pd
import numpy as np

### Series and Dataframes

`pandas` has two main data structures: `Series` objects and `DataFrame` objects. Let's explore them below.

In [3]:
# We can turn a list into a series
# with pd.Series()
print([0,1,2,3], type([0,1,2,3]))
print()
print(pd.Series([0,1,2,3]), type(pd.Series([0,1,2,3])))

[0, 1, 2, 3] <class 'list'>

0    0
1    1
2    2
3    3
dtype: int64 <class 'pandas.core.series.Series'>


The second thing we printed was a `Series` object. Note the two columns of numbers. The first column is the index of the object, the second column contains the values of the object. We can access those two separately like below.

In [4]:
# The index
pd.Series([0,1,2,3]).index

RangeIndex(start=0, stop=4, step=1)

In [5]:
# The values
pd.Series([0,1,2,3]).values

array([0, 1, 2, 3], dtype=int64)

In [6]:
## You practice
# Take the array labeled a and 
# turn it into a Series named b
a = [5,2,3,6,'a','b','e',True,False]







Now let's check out a `DataFrame`.

In [15]:
# We can make a DataFrame using a dictionary
# the dictionary keys are the column labels
# the dictionary values are columns
df = pd.DataFrame({'one':[3,4,5,2,4,5], 
                       'two':['a','b','e','h','l','p']})

# Note that this is not the only way to make 
# a dataframe!

In [16]:
df

Unnamed: 0,one,two
0,3,a
1,4,b
2,5,e
3,2,h
4,4,l
5,5,p


This is a `DataFrame`, the unlabeled column is the index, the labeled columns are `Series` objects themselves. We can access them in the following way

In [22]:
print(df[4:6])

   one two
4    4   l
5    5   p


In [25]:
df.at[1, 'two']

'b'

In [26]:
df.iat[1, 1]

'b'

In [9]:
# df.column_name
print(df.one) 
print()
print(type(df.one))

0    3
1    4
2    5
3    2
4    4
5    5
Name: one, dtype: int64

<class 'pandas.core.series.Series'>


In [10]:
# or df['column_name']
print(df['two']) 
print()
print(type(df['two']))

0    a
1    b
2    e
3    h
4    l
5    p
Name: two, dtype: object

<class 'pandas.core.series.Series'>


In [11]:
# Just like with series we can use .index
df.index

RangeIndex(start=0, stop=6, step=1)

In [None]:
## Practice
# Make a data frame 
# Make the first column labeled 'first' from a
# Make the second column labeled 'second' from b
# see what happens when you add , index=range(10,10+len(a)) 
# after the dictionary
a = [4,5,3,4,5,6,0]
b = ['a','c','d','g','l','m','p']







### Helpful `DataFrame` Functions

`pandas` offers some really nice built in function to help you explore any data set you're dealing with. Let's explore them below.

In [27]:
# We'll work with the following dataframe
df = pd.read_csv("JR_Smith_Shots_2015_16.csv")

In [28]:
# We can examine the top of the dataframe
# the default is the first 5 entries
df.head()

Unnamed: 0,LOC_X,LOC_Y,SHOT_MADE_FLAG
0,-106,244,0
1,-96,97,0
2,30,23,0
3,-204,-1,0
4,-76,237,0


In [29]:
# We can also look at the bottom
# put in a number lets us control the number of rows
df.tail(10)

Unnamed: 0,LOC_X,LOC_Y,SHOT_MADE_FLAG
839,-240,-16,0
840,-147,213,0
841,-81,242,1
842,171,178,1
843,46,252,0
844,145,-15,0
845,-241,67,0
846,164,195,0
847,0,1,1
848,-89,288,0


In [30]:
# We can get a random sample
df.sample(20)

Unnamed: 0,LOC_X,LOC_Y,SHOT_MADE_FLAG
388,-233,-21,0
574,150,195,0
748,-232,-11,0
311,151,198,1
768,225,105,0
614,-14,252,0
809,-153,85,0
96,32,10,0
448,82,254,0
281,-65,252,1


In [31]:
## We can sort our dataframe by a single column
df.sort_values('LOC_X')

Unnamed: 0,LOC_X,LOC_Y,SHOT_MADE_FLAG
81,-250,67,1
278,-248,46,1
813,-248,3,1
826,-248,18,0
305,-246,2,0
...,...,...,...
365,240,75,0
836,241,-16,1
191,241,8,0
369,241,90,0


In [32]:
# or by multiple columns
# and choose to go in descending order
df.sort_values(['LOC_X','LOC_Y'],ascending=[False,True])

Unnamed: 0,LOC_X,LOC_Y,SHOT_MADE_FLAG
535,245,-5,1
836,241,-16,1
191,241,8,0
369,241,90,0
745,240,-5,1
...,...,...,...
280,-246,101,1
813,-248,3,1
826,-248,18,0
278,-248,46,1


In [33]:
# We can drop certain values by index
df.drop([0,1,2,3]).head()

Unnamed: 0,LOC_X,LOC_Y,SHOT_MADE_FLAG
4,-76,237,0
5,25,23,1
6,43,47,1
7,48,100,0
8,22,16,1


In [None]:
# if you have missing data you can drop it too
df.dropna()

In [34]:
## Practice
## run this code
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
data = np.concatenate([iris.data,iris.target.reshape(-1,1)],axis=1)
column_names = [name[:-5].split(" ")[0] + "_" + name[:-5].split(" ")[1] for name in iris.feature_names]

column_names.append('class')

iris = pd.DataFrame(data,
                    columns = column_names)

In [35]:
## explore the iris dataframe



iris



Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2.0
146,6.3,2.5,5.0,1.9,2.0
147,6.5,3.0,5.2,2.0,2.0
148,6.2,3.4,5.4,2.3,2.0


### Getting Descriptive Statistics

`pandas` has more built in functions that will allow you to calculate some descriptive statistics that could be useful.

In [36]:
# find the max for each column
df.max()

LOC_X             245
LOC_Y             693
SHOT_MADE_FLAG      1
dtype: int64

In [37]:
# find the mean
df.mean()

LOC_X              -6.204947
LOC_Y             112.106007
SHOT_MADE_FLAG      0.414605
dtype: float64

In [38]:
# Get a list of summary stats
df.describe()

Unnamed: 0,LOC_X,LOC_Y,SHOT_MADE_FLAG
count,849.0,849.0,849.0
mean,-6.204947,112.106007,0.414605
std,157.201943,91.254429,0.492944
min,-250.0,-46.0,0.0
25%,-158.0,21.0,0.0
50%,-2.0,110.0,0.0
75%,133.0,193.0,1.0
max,245.0,693.0,1.0


In [40]:
# You can get a count of how many of each 
# value exist in a column
df.SHOT_MADE_FLAG.value_counts()

0    497
1    352
Name: SHOT_MADE_FLAG, dtype: int64

In [44]:
## Practice
## What is the min, and max of  petal_width 
## from the iris dataframe?


iris.describe()



Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [None]:
## Practice
## Provide summary statistics for all iris columns





### Subsetting and Searching a `DataFrame`

Sometimes we'll want to get a subset of a `DataFrame` or search for observations that fit a certain condition. There are a few ways we can do that.

In [46]:
print(len(df))

849


In [47]:
# .loc for logical subsetting
# first enter the boolean condition you're interested in
# then if you want certain columns you can enter that after 
# the comma
df.loc[df.LOC_Y > 25,]

Unnamed: 0,LOC_X,LOC_Y,SHOT_MADE_FLAG
0,-106,244,0
1,-96,97,0
4,-76,237,0
6,43,47,1
7,48,100,0
...,...,...,...
842,171,178,1
843,46,252,0
845,-241,67,0
846,164,195,0


In [48]:
# multiple conditions
df.loc[(df.LOC_Y > 20) & (df.LOC_X >40),['LOC_X','LOC_Y']]

Unnamed: 0,LOC_X,LOC_Y
6,43,47
7,48,100
10,163,141
12,125,160
14,143,203
...,...,...
837,125,144
838,112,232
842,171,178
843,46,252


In [55]:
iris2 = iris.loc[iris['class'] > 1, ]

In [None]:
# Subset with a numeric index
# use iloc, first rows then columns
# gives rows 14 through 23
df.iloc[14:23,1]

In [None]:
# We can even groupby for categorical variables 
# to make calculating summary stats easier
df.groupby('SHOT_MADE_FLAG').mean()

In [None]:
## Practice
## What is the maximum sepal_length by class?
## Which iris observation has the minimal petal_width?









### Reading and Writing From csvs

There are many different file types that contain data, but one of the most basic are `comma separated value` or `csv` files.

We'll finish the notebook by learning how to read in data from a csv file and how to write our data to a csv file.

In [41]:
# In this folder is a file labeled 
# "JR_Smith_Shots_2015_16.csv"
# reading it in is simple just use pd.read_csv(file_name)
df = pd.read_csv("JR_Smith_Shots_2015_16.csv")

In [42]:
df.head()

Unnamed: 0,LOC_X,LOC_Y,SHOT_MADE_FLAG
0,-106,244,0
1,-96,97,0
2,30,23,0
3,-204,-1,0
4,-76,237,0


In [None]:
## Practice
## read in the file beers.csv







In [None]:
# Let's make a new dataframe
df = pd.DataFrame({'one':[1,2,4,5,6,3],'two':[4,3,2,6,7,3]})

In [None]:
# It can be written to a csv file with
# .to_csv(file_name)
df.to_csv("test.csv")

In [None]:
## Practice
## read in test.csv here
## then look at the head







In [None]:
# We can avoid writing the index to file like so
df.to_csv("test.csv",index=False)

## Task: the DNA Par file

In [None]:
## Practice
## read in test_dnapar.par and turn it into a dataframe here
## then look at the head







In [None]:
## What is the average 'Roll' and the average 'Twist'


## What is the max rise values?




In [None]:
## Make a new column, Bend, which is the square root of tilt-squared + roll-squared



## Task: the DNA PDB file



In [None]:
## Load the test_dnapdb.pdb file 
## Remove any line that starts with CONECT
## Load remainging file into a dataframe
## Study this url (https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction)
## Determine what the column headers should be for a pdb file and then add that to the dataframe






In [None]:
## What is the average x coordinate for P atoms in Chain A?





In [None]:
## How many atoms are in Chain B?





### The End

That's it for this notebook! Now try and complete the pandas - Skill Test Notebook.

You should know enough `pandas` to get started with data handling in `python`. If you want to learn more check out the documentation, <a href = "https://pandas.pydata.org/docs">https://pandas.pydata.org/docs</a>, or just search the web if you have a specific question.