# Intro to pandas (condensed)

This follows the pandas documentation of the "10 minutes to pandas" found [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)

Also see the [essential basic functionality](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics) common to the pandas data structures.

In [3]:
import numpy as np
import pandas as pd

Create a pandas Series:

In [4]:
s = pd.Series([1,2,-1,4], 
   index=['a','b','c','d'])
s

a    1
b    2
c   -1
d    4
dtype: int64

Create pandas DataFrame

In [1]:
# Create example DataFrame
df = pd.DataFrame(([1,1,2,-3], [5,6,9,8], [0,4,2,2]), index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
df

NameError: name 'pd' is not defined

In [69]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [70]:
df2.A

0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

In [71]:
#df2. # Use tab completion to see the attributes

### View data

In [72]:
df.head() # View top rows of DataFrame
df.tail(2) # View bottom two rows of DataFrame

Unnamed: 0,A,B,C,D
2018-01-05,1.221454,-1.015721,0.294117,-1.022751
2018-01-06,-0.565945,1.723828,0.367676,-1.059688


In [73]:
#Display the index, columns:
df.index  
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

Note that a fundamental difference between pandas and NumPy is that NumPy arrays have one dtype for the entire array, whereas pandas DataFrames have one dtype per column. 

When you call `DataFrame.to_numpy()`, pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This might wind up being object, which therefore casts every value to a Python object.

In [74]:
df.to_numpy()

array([[-0.92743372, -0.65880372,  1.54449936,  0.38713799],
       [-0.85020442,  0.27710267,  1.76407743,  1.83489242],
       [ 0.12715449,  0.64734314, -0.8359118 ,  0.93972711],
       [ 0.22864553, -0.35752808,  0.44147455, -0.1586813 ],
       [ 1.22145381, -1.01572128,  0.29411731, -1.02275094],
       [-0.56594515,  1.72382766,  0.36767641, -1.0596884 ]])

`DataFrame.to_numpy()` does not include the index or column labels in the output.

In [75]:
# describe() shows a quick statistic summary of your data:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.127722,0.102703,0.595989,0.153439
std,0.820463,0.999767,0.946739,1.136017
min,-0.927434,-1.015721,-0.835912,-1.059688
25%,-0.77914,-0.583485,0.312507,-0.806734
50%,-0.219395,-0.040213,0.404575,0.114228
75%,0.203273,0.554783,1.268743,0.80158
max,1.221454,1.723828,1.764077,1.834892


In [76]:
# Transpose data:
df.T

Unnamed: 0,2018-01-01 00:00:00,2018-01-02 00:00:00,2018-01-03 00:00:00,2018-01-04 00:00:00,2018-01-05 00:00:00,2018-01-06 00:00:00
A,-0.927434,-0.850204,0.127154,0.228646,1.221454,-0.565945
B,-0.658804,0.277103,0.647343,-0.357528,-1.015721,1.723828
C,1.544499,1.764077,-0.835912,0.441475,0.294117,0.367676
D,0.387138,1.834892,0.939727,-0.158681,-1.022751,-1.059688


In [77]:
# Sorting by an axis
# set axis=0 to sort rows by index, and set axis=1 to sort columns 
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2018-01-01,0.387138,1.544499,-0.658804,-0.927434
2018-01-02,1.834892,1.764077,0.277103,-0.850204
2018-01-03,0.939727,-0.835912,0.647343,0.127154
2018-01-04,-0.158681,0.441475,-0.357528,0.228646
2018-01-05,-1.022751,0.294117,-1.015721,1.221454
2018-01-06,-1.059688,0.367676,1.723828,-0.565945


In [78]:
# Sorting by values:
df.sort_values(by='A')

Unnamed: 0,A,B,C,D
2018-01-01,-0.927434,-0.658804,1.544499,0.387138
2018-01-02,-0.850204,0.277103,1.764077,1.834892
2018-01-06,-0.565945,1.723828,0.367676,-1.059688
2018-01-03,0.127154,0.647343,-0.835912,0.939727
2018-01-04,0.228646,-0.357528,0.441475,-0.158681
2018-01-05,1.221454,-1.015721,0.294117,-1.022751


In [79]:
s.sort_values()

e   -1.0
a    1.0
b    2.0
d    6.0
c    NaN
dtype: float64

In [80]:
s.describe()

count    4.00000
mean     2.00000
std      2.94392
min     -1.00000
25%      0.50000
50%      1.50000
75%      3.00000
max      6.00000
dtype: float64

In [81]:
s.to_numpy()

array([ 1.,  2., nan,  6., -1.])

In [82]:
s.tail(2)

d    6.0
e   -1.0
dtype: float64

### Selecting

In [83]:
#Select a single column (which yields a Series):
df['A']

# Select via [], which slices rows
df[0:2]

Unnamed: 0,A,B,C,D
2018-01-01,-0.927434,-0.658804,1.544499,0.387138
2018-01-02,-0.850204,0.277103,1.764077,1.834892


#### Select by label:

In [84]:
# Get row by label:
df.loc[dates[0]]

# Get multi-axis by label:
df.loc[:, ['A', 'B']]

# Slicing with the label:
df.loc['20180102':'20180104', ['A', 'B']]

# Get a scalar value:
df.loc[dates[0], 'A']

-0.9274337156004528

#### Select by position:

In [85]:
# Select via a passed integer:
df.iloc[3]

# Slice integers:
df.iloc[3:5, 0:3]

# Use lists of integer position locations:
df.iloc[[2, 4, 5], [0, 2]]

# Slice rows:
df.iloc[1:4, :]

# Slice columns:
df.iloc[:, 0:2]

# Get specific value:
df.iloc[1,2]

1.7640774288429184

#### Boolean indexing

In [86]:
# Use a single column’s values to select data
df[df.A > 0]

Unnamed: 0,A,B,C,D
2018-01-03,0.127154,0.647343,-0.835912,0.939727
2018-01-04,0.228646,-0.357528,0.441475,-0.158681
2018-01-05,1.221454,-1.015721,0.294117,-1.022751


In [87]:
s[s > 0]

a    1.0
b    2.0
d    6.0
dtype: float64

In [88]:
# Select values from DataFrame where a boolean condition is met
df[df > 0]

Unnamed: 0,A,B,C,D
2018-01-01,,,1.544499,0.387138
2018-01-02,,0.277103,1.764077,1.834892
2018-01-03,0.127154,0.647343,,0.939727
2018-01-04,0.228646,,0.441475,
2018-01-05,1.221454,,0.294117,
2018-01-06,,1.723828,0.367676,


In [89]:
# Use the isin() method to filter:
df3 = df.copy()
df3['E'] = ['one', 'two', 'one', 'three', 'five', 'two']
df3['E'].isin(['onw', 'two'])
df3[df3['E'].isin(['one', 'two'])]


Unnamed: 0,A,B,C,D,E
2018-01-01,-0.927434,-0.658804,1.544499,0.387138,one
2018-01-02,-0.850204,0.277103,1.764077,1.834892,two
2018-01-03,0.127154,0.647343,-0.835912,0.939727,one
2018-01-06,-0.565945,1.723828,0.367676,-1.059688,two


In [90]:
df3.loc['20180101':'20180103', ['A', 'B', 'E']].isin(['one'])

Unnamed: 0,A,B,E
2018-01-01,False,False,True
2018-01-02,False,False,False
2018-01-03,False,False,True


### Setting

In [91]:
# Setting a new column aligns the data automatically by the indexes
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130101', periods=6))
s1
df3['F'] = s1 
df3

Unnamed: 0,A,B,C,D,E,F
2018-01-01,-0.927434,-0.658804,1.544499,0.387138,one,
2018-01-02,-0.850204,0.277103,1.764077,1.834892,two,
2018-01-03,0.127154,0.647343,-0.835912,0.939727,one,
2018-01-04,0.228646,-0.357528,0.441475,-0.158681,three,
2018-01-05,1.221454,-1.015721,0.294117,-1.022751,five,
2018-01-06,-0.565945,1.723828,0.367676,-1.059688,two,


In [92]:
# Setting by assigning a NumPy array
df3.loc[:, 'D'] = np.array([5] * len(df3)) 
df3

Unnamed: 0,A,B,C,D,E,F
2018-01-01,-0.927434,-0.658804,1.544499,5,one,
2018-01-02,-0.850204,0.277103,1.764077,5,two,
2018-01-03,0.127154,0.647343,-0.835912,5,one,
2018-01-04,0.228646,-0.357528,0.441475,5,three,
2018-01-05,1.221454,-1.015721,0.294117,5,five,
2018-01-06,-0.565945,1.723828,0.367676,5,two,


### Missing Data
For more information, see [this](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data) section.

Note that pandas generally uses the value `np.nan` to represent missing data.

In [99]:
# Reindexing can change/add/delete the index on a specified axis and it returns a copy of the data.
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1 # Setting the value 1 to the first two labels in column E

# Drop rows with missing data
df1.dropna(how='any')

# Filling missing data
df1.fillna(value=4)

# Get the boolean mask where values are nan
pd.isna(df1)

Unnamed: 0,A,B,C,D,E
2018-01-01,False,False,False,False,False
2018-01-02,False,False,False,False,False
2018-01-03,False,False,False,False,True
2018-01-04,False,False,False,False,True


### Statistics