# Intro to pandas Data Structures

Let's work through the pandas documentation. It might be best to begin with the introduction to [data structures](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html).

In [181]:
import numpy as np
import pandas as pd

%matplotlib inline 

## Series
One-dimensional labeled array that can hold any data type.

#### Use the method to create a Series, therefore, call:
    s = pd.Series(data, index)

In [182]:
data = np.random.randn(5)
index = ['a', 'b', 'c', 'd', 'e']

s = pd.Series(data, index=index)
s

a   -1.352094
b   -1.749655
c   -0.029968
d   -0.602417
e   -0.664488
dtype: float64

In [183]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [184]:
s.keys()

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [185]:
# Series can be instantiated from dicts:
d = {'a': 1, 'b': 2, 'c': 3}
s_dict = pd.Series(d)
print (s_dict)
#s_dict.index

a    1
b    2
c    3
dtype: int64


#### Series acts similarly to a ndarray, and is a valid argument to most NumPy functions. But, operations such as slicing will also slice the index:

In [186]:
s[0]
s[:3]
s[s > s.median()]


c   -0.029968
d   -0.602417
dtype: float64

In [187]:
s[[4,3,1]]

e   -0.664488
d   -0.602417
b   -1.749655
dtype: float64

In [188]:
# Calculate the exponential of all elements in the input array, which in this case is a Series
np.exp(s)

a    0.258698
b    0.173834
c    0.970477
d    0.547487
e    0.514537
dtype: float64

In [189]:
s.dtype

dtype('float64')

If you want a NumPy array representing the values in this Series or Index, use one of these:

Index.to_numpy()

Series.to_numpy() # Use this if you need an actual ndarray

In [190]:
s.to_numpy()

array([-1.35209378, -1.74965475, -0.02996787, -0.60241745, -0.66448848])

#### A Series is like a fixed-size dict because you can get the values by using the index label:

In [191]:
# Get the label 'a'
s['a']

# Or, use the get method:
#s.get('a')

-1.3520937792885703

In [192]:
# Add a new index and value
s['e'] = 12

In [193]:
s

a    -1.352094
b    -1.749655
c    -0.029968
d    -0.602417
e    12.000000
dtype: float64

In [194]:
'e' in s

True

In [195]:
'f' in s

False

In [196]:
# If a label is not contained, an exception is raised, for example:
# s['f']

#### Vectorized operations and label alignment with Series

Series can also be passed into most NumPy methods that are expecting an ndarray. When working with raw NumPy arrays, looping through value by value is not usually needed. This is also true when working with Series in pandas. 

In [197]:
s + s

a    -2.704188
b    -3.499309
c    -0.059936
d    -1.204835
e    24.000000
dtype: float64

In [198]:
s * 2

a    -2.704188
b    -3.499309
c    -0.059936
d    -1.204835
e    24.000000
dtype: float64

Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [199]:
s[1:] + s[: -1]

a         NaN
b   -3.499309
c   -0.059936
d   -1.204835
e         NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. 

You of course have the option of dropping labels with missing data via the `dropna` function.

#### Series can also have a `name` attribute
An attribute is a name that appears after an object name

In [200]:
s2 = pd.Series(data, name='something')
s2

0    -1.352094
1    -1.749655
2    -0.029968
3    -0.602417
4    12.000000
Name: something, dtype: float64

## DataFrame
A DataFrame is a 2-D labeled data structure that can have columns with different types. Similar to Series, DataFrame accepts many different kinds of input:
-    Dict of 1D ndarrays, lists, dicts, or Series
-    2-D numpy.ndarray
-    Structured or record ndarray
-    A Series
-    Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.

#### Use the method to create a DataFrame, therefore, call:
    df = pd.DataFrame(data, index, columns)
    

In [201]:
# Syntax to create a DataFrame
df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], index=['dog', 'cat', 'cow'], columns=['a', 'b', 'c'])

# Syntax to add a column to the DataFrame
df['d']= [10,11,12] # Add a new column to the DataFrame using sqaure bracket notation
df.loc[:, 'e'] = [13,14,15] # Add a new column to the DataFrame using the loc function
df['a_trunc'] = df['a'][:2] # Add a series. If it doesn't have the same index as the DataFrame's index, it will be conformed to the DataFrame's index

# Add a row to the DataFrame
df.loc['pig'] = [10,10,15,20,30,40] # Add a new row to the DataFrame using the loc function

# *note that getting, setting, and deleting columns uses a similar syntax as dict operations
# Get a column from a DataFrame 
df['a']

# Delete a column from a DataFrame
del df['d'] # Delete a column
df.pop('b') # Remove a column using the pop() function

# Delete a row by label from a DataFrame
df.drop(labels='cat')
# Delete a row by number from a DataFrame
#df.drop(df.index[0])


Unnamed: 0,a,c,e,a_trunc
dog,1,3,13,1.0
cow,7,9,15,
pig,10,15,30,40.0


### From dict of Series or dicts

In [202]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [203]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [204]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [205]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


In [206]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [207]:
df.columns

Index(['one', 'two'], dtype='object')

### From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

In [208]:
dict = {'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]}
pd.DataFrame(dict, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


### From structured or record array

In [209]:
#data = np.zeros((2, 3), dtype=float)
data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')]) # Note here, for example, that 'A' is the object to be converted to a data type object
#print (data)
data[:] = [(1, 2., 'Hello'), (2, 3., "World")]
pd.DataFrame(data)



Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


In [210]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.0,b'World'


In [211]:
pd.DataFrame(data, columns=['C', 'A', 'B'])

Unnamed: 0,C,A,B
0,b'Hello',1,2.0
1,b'World',2,3.0


In [212]:
# Create MultiIndexed frame by passing in a tuples dict
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 4},
             ('a', 'a'): {('A', 'C'): 9, ('A', 'B'): 5}})


Unnamed: 0_level_0,Unnamed: 1_level_0,a,a
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a
A,B,1,5
A,C,4,9


In [213]:
# When conducting an operation between a DataFrame and a Series, the Series index is aligned to the DataFrame columns, therefore, the broadcasting would be row-wise, e.g.:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df - df.iloc[0]

# *Note that if working with time series data, and the DataFrame index also contains dates, the broadcasting will be column-wise

Unnamed: 0,A,B,C,D
0,0.0,0.0,0.0,0.0
1,-0.025307,-0.963495,1.104229,-1.98695
2,-0.77661,1.958636,0.145169,-2.968172
3,-1.175037,0.261131,0.037827,-1.557275
4,0.681258,0.840632,-1.517578,-1.637086
5,-1.470099,1.365118,-1.431201,-1.748706
6,-0.738268,-0.750468,-0.690124,-2.802853
7,-0.289313,1.003274,-1.965261,0.016136
8,-0.857165,-0.228472,0.034651,-0.997392
9,-1.412625,0.831527,-0.924968,-1.238718
