# Pandas Notes 
Most of these notes came/from were inspired by the dtsc520 class i took at eastern university.
<ul>
    <li>https://wesmckinney.com/book/ </li>
<li>https://jakevdp.github.io/PythonDataScienceHandbook/  I have the third edition</li>
<li>https://pandas.pydata.org/docs/user_guide/10min.html</li>
<li>https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook</li>
</ul>

## Pandas Data Structures
<b>Series</b> - one dimensional array that can hold mixed data types<br>
<b>DataFrame</b> -multi-dimensional array that holds data like a two dimensional array with rows and columns
## Object Creation
<b> Series</b> a sequence of values, with an explicitly defined sequence of indices.  We can access the values  of the series using value and array attributes.<br>
Series use array based index by default if not specified.<br>
<b>pd.series(data, index=index)</b><br>
pd.Series has an explicitly defined index. np.array has an implicitely defined array

In [75]:
import numpy as np
import pandas as pd #always need to do your imports
# define a series with a sequence
s = pd.Series([1,3,5, np.nan, 6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Series is very similar to a dictionary that manipulated with array operations like slice.
Here we can create a series using a dictionary

In [76]:
fruit_inv_dict = {
    'Apple': 50,
    'Peaches': 75,
    'Oranges': 10,
    'Pears': 3
}
fruit_inv = pd.Series(fruit_inv_dict)
fruit_inv

Apple      50
Peaches    75
Oranges    10
Pears       3
dtype: int64

<b>DataFrame</b> two dimensional array with explicit ROW and COLUMN INDICES<br>
can be created with values, either defined or randomlym with defined index and columns<br>
<b>df = pd.DataFrame(values, index=index_name, columns=[list]</b>

In [77]:
#Here we will create another series and then combine it with fruit_inv to cvreate a df
fruit_inv_dict = {
    'Apple': 'red',
    'Peaches': 'yellow blush',
    'Oranges': 'Orange',
    'Pears': 'green'
}
fruit_colors = pd.Series(fruit_inv_dict)
fruit_colors

Apple               red
Peaches    yellow blush
Oranges          Orange
Pears             green
dtype: object

In [78]:
fruits = pd.DataFrame({'Inventory':fruit_inv, 'Color':fruit_colors})
fruits

Unnamed: 0,Inventory,Color
Apple,50,red
Peaches,75,yellow blush
Oranges,10,Orange
Pears,3,green


In [79]:
# Can also create a DataFrame using a dictionary
df_index = ['row1', 'row2', 'row3','row4']
df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }, index = df_index
)
df

Unnamed: 0,A,B,C,D,E,F
row1,1.0,2013-01-02,,3,test,foo
row2,1.0,2013-01-02,,3,train,foo
row3,1.0,2013-01-02,,3,test,foo
row4,1.0,2013-01-02,,3,train,foo


## Pull some basic info

In [80]:
#Pull some basic attributes:
print('DataFrame head')
print(df.head(2))
print('Tail')
print(df.tail(3))

DataFrame head
        A          B   C  D      E    F
row1  1.0 2013-01-02 NaN  3   test  foo
row2  1.0 2013-01-02 NaN  3  train  foo
Tail
        A          B   C  D      E    F
row2  1.0 2013-01-02 NaN  3  train  foo
row3  1.0 2013-01-02 NaN  3   test  foo
row4  1.0 2013-01-02 NaN  3  train  foo


In [81]:
df.index

Index(['row1', 'row2', 'row3', 'row4'], dtype='object')

In [82]:
df.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [83]:
#return a np representation without index or column labels
df.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), nan, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), nan, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), nan, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), nan, 3, 'train', 'foo']],
      dtype=object)

NumPy arrays have one dtype for the entire array while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), <b> pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame.</b> If the common data type is object, DataFrame.to_numpy() will require copying data.

In [84]:
# describe shows a quick statistical summary of the data
df.describe()

Unnamed: 0,A,B,C,D
count,4.0,4,0.0,4.0
mean,1.0,2013-01-02 00:00:00,,3.0
min,1.0,2013-01-02 00:00:00,,3.0
25%,1.0,2013-01-02 00:00:00,,3.0
50%,1.0,2013-01-02 00:00:00,,3.0
75%,1.0,2013-01-02 00:00:00,,3.0
max,1.0,2013-01-02 00:00:00,,3.0
std,0.0,,,0.0


In [85]:
# transpose teh data
df.T

Unnamed: 0,row1,row2,row3,row4
A,1.0,1.0,1.0,1.0
B,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00
C,,,,
D,3,3,3,3
E,test,train,test,train
F,foo,foo,foo,foo


In [86]:
#sort by index.  sort by columns -> axis = 1 sort in descending order -> ascending = False
df.sort_index(axis=1, ascending=False)

Unnamed: 0,F,E,D,C,B,A
row1,foo,test,3,,2013-01-02,1.0
row2,foo,train,3,,2013-01-02,1.0
row3,foo,test,3,,2013-01-02,1.0
row4,foo,train,3,,2013-01-02,1.0


In [87]:
#sort by the values of b column
fruits.sort_values(by='Inventory')

Unnamed: 0,Inventory,Color
Pears,3,green
Oranges,10,Orange
Apple,50,red
Peaches,75,yellow blush


## Selection methods for pandas
### Selection by labels use df.loc('row_label', column_labels]
args to the method are strings, lists, etc

In [88]:
#pass a single label selects a column and yeilds a series equivalent to fruits.a
fruits['Inventory']

Apple      50
Peaches    75
Oranges    10
Pears       3
Name: Inventory, dtype: int64

In [89]:
#pass a slice grabs matching rows [x:y]  X IS INCLUSIVE, Y is EXCLUSIVE
slice = df[0:3]
slice.describe()

Unnamed: 0,A,B,C,D
count,3.0,3,0.0,3.0
mean,1.0,2013-01-02 00:00:00,,3.0
min,1.0,2013-01-02 00:00:00,,3.0
25%,1.0,2013-01-02 00:00:00,,3.0
50%,1.0,2013-01-02 00:00:00,,3.0
75%,1.0,2013-01-02 00:00:00,,3.0
max,1.0,2013-01-02 00:00:00,,3.0
std,0.0,,,0.0


In [90]:
#Select a row matching the label  choose the first element in the df_index variable
#.loc is case sensitve
#.iloc is integer number
df.loc[df_index[1]]

A                    1.0
B    2013-01-02 00:00:00
C                    NaN
D                      3
E                  train
F                    foo
Name: row2, dtype: object

In [91]:
#Select all rows with specific columns
#[:,'A','B'] - colon and comma signify ALL rows
df.loc[:,['E','F']]

Unnamed: 0,E,F
row1,test,foo
row2,train,foo
row3,test,foo
row4,train,foo


In [95]:
#label slicing with both endpoints
# .loc is end INCLUSIVE
df.loc['row1':'row3', ['C','D','E','F']]

Unnamed: 0,C,D,E,F
row1,,3,test,foo
row2,,3,train,foo
row3,,3,test,foo


### Selection by position df.iloc()

In [94]:
df.iloc[3]

A                    1.0
B    2013-01-02 00:00:00
C                    NaN
D                      3
E                  train
F                    foo
Name: row4, dtype: object

In [98]:
#Slices like np and python
df.iloc[1:3, 0:2]

Unnamed: 0,A,B
row2,1.0,2013-01-02
row3,1.0,2013-01-02


In [104]:
#list of integers positions
df.iloc[[1,2], [0,2,4]]

Unnamed: 0,A,C,E
row2,1.0,,train
row3,1.0,,test


In [105]:
#slice rows exactly
df.iloc[0:2,:]

Unnamed: 0,A,B,C,D,E,F
row1,1.0,2013-01-02,,3,test,foo
row2,1.0,2013-01-02,,3,train,foo


In [108]:
#slice columns exactly
df.iloc[:,3:]

Unnamed: 0,D,E,F
row1,3,test,foo
row2,3,train,foo
row3,3,test,foo
row4,3,train,foo


In [110]:
#get a specific value
df.iloc[3,4]

'train'