# Introduction to pandas
## importing modules
When using pandas, import:

    import pandas as pd
    import numpy as np

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## Two data structure types: (1) Series and (2) Data Frames
Series are used for one-dimensional data (like a 1D array). Data frames are used for more complex data (like ndarrays).

## Series:
A series is composed of two arrays associated with each other. The main array contains the data **Values** to which each element is associated with a label, contained within the other array, called the **Index**. It looks something like this:

| Index | Value |
|-------|-------|
| 0 | 4 |
| 1 | 1 |
| 2 | 8 |
| 3 | 3 |

### To declare a series, simply call the `Series()` constructor. Pass in an array containing the values to include in the series. For example:

In [4]:
MyFirstSeries=pd.Series([4,1,8,3])
print(MyFirstSeries)

0    4
1    1
2    8
3    3
dtype: int64


### By default, the index are numerical values increasing from 0 (the first column in MyFirstSeries). But sometimes it is useful to use meaningful labels. Use the `index` option to assign an array containing the labels.

In [6]:
MyFirstSeries=pd.Series([4,1,8,3], index=['A','B','C','D'])
print(MyFirstSeries)

A    4
B    1
C    8
D    3
dtype: int64


### You can call the `index` and `values` separately.

In [9]:
MyFirstSeries.values

array([4, 1, 8, 3])

In [10]:
MyFirstSeries.index

Index(['A', 'B', 'C', 'D'], dtype='object')

### Select elements of a pandas Series using ordinary numpy array notation, OR by specifying the label.

In [13]:
print(MyFirstSeries[2])
print(MyFirstSeries['C'])

8
8


#### You can also select multiple items.

In [14]:
print(MyFirstSeries[0:2])
print(MyFirstSeries[['B','C']])

A    4
B    1
dtype: int64
B    1
C    8
dtype: int64


### Similarly, you can assign values to an element in a Series.

In [21]:
MyFirstSeries[1]=6
print("MyFirstSeries[1]=6:\n",MyFirstSeries)
MyFirstSeries['C']=6
print("MyFirstSeries['C']=6\n",MyFirstSeries)

MyFirstSeries[1]=6:
 A    4
B    6
C    8
D    3
dtype: int64
MyFirstSeries['C']=6
 A    4
B    6
C    6
D    3
dtype: int64


### Defining `Series` from NumPy Arrays and other Series

In [22]:
arr=np.array([1,2,3,4,5])
series2=pd.Series(arr)
series2

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [23]:
series3=pd.Series(series2)
series3

0    1
1    2
2    3
3    4
4    5
dtype: int64

#### Remember! The values from the original NumPy array or Series are NOT copied. They are passed by reference. If the original object changes its values, then the new Series will also experience those changes. To overcome this, a new copy must be made.

In [24]:
series2[2]=0
series3

0    1
1    2
2    0
3    4
4    5
dtype: int64

### Filtering values.
Can use basic logic to filter values.

In [25]:
series2>2

0    False
1    False
2    False
3     True
4     True
dtype: bool

In [26]:
series2[series2>2]

3    4
4    5
dtype: int64

### Can also to basic math with  Series.

In [27]:
series2/2

0    0.5
1    1.0
2    0.0
3    2.0
4    2.5
dtype: float64

In [29]:
np.sin(series2)

0    0.841471
1    0.909297
2    0.000000
3   -0.756802
4   -0.958924
dtype: float64

### And also some ways to evaluate.

In [30]:
serd=pd.Series([1,0,2,1,2,3],index=['white','white','blue','green','green','yellow'])
serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

In [31]:
serd.unique()

array([1, 0, 2, 3])

In [32]:
serd.value_counts()

2    2
1    2
3    1
0    1
dtype: int64

In [33]:
serd['white']

white    1
white    0
dtype: int64

## To keep only the first of duplicated entries.
### for example, in my outputted position data files, sometimes there are duplicated lines due to multithreading. I could easily get rid of them with this syntax

In [36]:
a=serd[~serd.index.duplicated(keep='first')]
a

white     1
blue      2
green     1
yellow    3
dtype: int64

## Data frames:
Tabular data structure like a spreadsheet.

| Index | Value | Color | Time |
|-------|:-------:|:-------:|------|
| 0 | 4 | yellow| 0.01 |
| 1 | 1 | blue | 0.02 |
| 2 | 8 | green |0.03 |
| 3 | 3 | red   |0.04 |

A data frame has 2 index arrays. The first is associated with the rows (like in a Series). The second is associated with the columns. In the above example, the first index array would be [0,1,2,3] and the second would be ['Value','Color','Time'].

### Defining data frames:

The most common way to create a new DataFrame is to pass in a dict. 
When making a data frame, can use only a subsection of columns, can redefine the index and columns.

In [39]:
data={'Value':[4,1,8,3],'Color':['yellow','blue','green','red'],'Time':[0.01,0.02,0.03,0.04]}
frame1=pd.DataFrame(data)
frame1

Unnamed: 0,Color,Time,Value
0,yellow,0.01,4
1,blue,0.02,1
2,green,0.03,8
3,red,0.04,3


In [40]:
frame2=pd.DataFrame(data,columns=['Time','Value'])
frame2

Unnamed: 0,Time,Value
0,0.01,4
1,0.02,1
2,0.03,8
3,0.04,3


In [49]:
frame3=pd.DataFrame(data,index=np.arange(0,0.4,0.1))
frame3

Unnamed: 0,Color,Time,Value
0.0,yellow,0.01,4
0.1,blue,0.02,1
0.2,green,0.03,8
0.3,red,0.04,3


In [52]:
frame4=pd.DataFrame(np.arange(16).reshape((4,4)),columns=['x','y','z','time'],index=np.arange(0,0.4,0.1))
frame4

Unnamed: 0,x,y,z,time
0.0,0,1,2,3
0.1,4,5,6,7
0.2,8,9,10,11
0.3,12,13,14,15


### Selecting elements of a panda DataFrame

http://pandas.pydata.org/pandas-docs/stable/indexing.html

In [53]:
frame4.index

Float64Index([0.0, 0.1, 0.2, 0.3], dtype='float64')

In [54]:
frame4.columns

Index(['x', 'y', 'z', 'time'], dtype='object')

In [55]:
frame4.values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

### Accessing Columns of a data frame. 

In [56]:
frame4['x']

0.0     0
0.1     4
0.2     8
0.3    12
Name: x, dtype: int64

In [57]:
frame4.x

0.0     0
0.1     4
0.2     8
0.3    12
Name: x, dtype: int64

### Accessing Rows of a DataFrame
Use iloc to indicate the row number.

In [61]:
frame4.iloc[1]

x       4
y       5
z       6
time    7
Name: 0.1, dtype: int64

In [64]:
frame4.iloc[[2,3]]  #Note the double brackets here to get a range of rows. Just one set of brackets selects the 2nd row and 3rd column.

Unnamed: 0,x,y,z,time
0.2,8,9,10,11
0.3,12,13,14,15


In [75]:
frame4[:2]

Unnamed: 0,x,y,z,time
0.0,0,1,2,3
0.1,4,5,6,7
0.2,8,9,10,11
0.3,12,13,14,15
