# Intro to Data Structures
http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro

In [1]:
#Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Series

**Series** is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index**. The basic method to create a Series is to call:

s = pd.Series(data, index=index)

Here, data can be many different things:

- a Python dict
- an ndarray (NumPy’s array class)
- a scalar value

The passed **index** is a list of axis labels. Thus, this separates into a few cases depending on what **data is**:

#### From ndarray

If data is an ndarray, **index** must be the same length as **data.** If no index is passed, one will be created having values ```[0, ..., len(data) - 1].```

In [3]:
s = pd.Series(np.random.randn(5), index = ['a', 'b', 'c', 'd', 'e'])
s

a    0.396318
b    1.126408
c   -1.063095
d    1.871038
e   -0.442764
dtype: float64

In [4]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [5]:
pd.Series(np.random.randn(5))

0   -1.268491
1    1.756098
2    0.277826
3   -0.750573
4   -0.656236
dtype: float64

<div class="alert alert-block alert-info">**Note** pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time. The reason for being lazy is nearly all performance-based (there are many instances in computations, like parts of GroupBy, where the index is not used).</div>

#### From dict

If data is a dict, if **index** is passed the values in data corresponding to the labels in the index will be pulled out. Otherwise, an index will be constructed from the sorted keys of the dict, if possible.

In [6]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

In [7]:
pd.Series(d, index=['b', 'c', 'd', 'a'])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

<div class="alert alert-block alert-info">**Note:** NaN (not a number) is the standard missing data marker used in pandas</div>

**From scalar value** If data is a scalar value, an index must be provided. The value will be repeated to match the length of **index**

In [8]:
pd.Series(5, index = ['a', 'b', 'c', 'd', 'e'])

a    5
b    5
c    5
d    5
e    5
dtype: int64

### Series is ndarray-like

Series acts very similarly to a *ndarray*, and is a valid argument to most NumPy functions. However, things like slicing also slice the index.

In [9]:
s # s, as created above with randn

a    0.396318
b    1.126408
c   -1.063095
d    1.871038
e   -0.442764
dtype: float64

In [10]:
s[0]

0.39631807684760217

In [11]:
s[:3]

a    0.396318
b    1.126408
c   -1.063095
dtype: float64

In [12]:
s.median()

0.3963180768476022

In [13]:
s[s > s.median()]

b    1.126408
d    1.871038
dtype: float64

In [14]:
s[[4,3,1]]

e   -0.442764
d    1.871038
b    1.126408
dtype: float64

In [15]:
np.exp(s) # Euler’s number,approximately 2.718281, raised to the power of each respective element's value is array s

a    1.486342
b    3.084557
c    0.345385
d    6.495037
e    0.642259
dtype: float64

### Series is dict-like
A Series is like a fixed-size dict in that you can get and set values by index label:

In [16]:
s['a']

0.39631807684760217

In [17]:
s['e']

-0.44276388116105242

In [18]:
s

a    0.396318
b    1.126408
c   -1.063095
d    1.871038
e   -0.442764
dtype: float64

In [19]:
'e' in s

True

In [20]:
'f' in s

False

If a label is not contained, an exception is raised:

In [21]:
# s['f']
# TypeError ... KeyError: 'f'

Using the get method, a missing label will return None or specified default:

In [22]:
s.get('f')

In [23]:
s.get('f', np.nan)

nan

See also "Indexing and Selecting Data" - "Attribute Access"
- http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-attribute-access

### Vectorized operations and label alignment with Series
When doing data analysis, as with raw NumPy arrays looping through Series value-by-value is usually not necessary. Series can also be passed into most NumPy methods expecting an ndarray.

In [24]:
s + s

a    0.792636
b    2.252816
c   -2.126190
d    3.742077
e   -0.885528
dtype: float64

In [25]:
s * 2

a    0.792636
b    2.252816
c   -2.126190
d    3.742077
e   -0.885528
dtype: float64

In [26]:
np.exp(s)

a    1.486342
b    3.084557
c    0.345385
d    6.495037
e    0.642259
dtype: float64

A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [27]:
s[1:] + s[:-1]

a         NaN
b    2.252816
c   -2.126190
d    3.742077
e         NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

<div class="alert alert-block alert-info">**Note** In general, we chose to make the default result of operations between differently indexed objects yield the **union** of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is typically important information as part of a computation. You of course have the option of dropping labels with missing data via the **dropna** function.</div>

###  Name attribute

Series can also have a name attribute:

In [30]:
s = pd.Series(np.random.randn(5), name = 'something')
s

0    0.674542
1   -0.751795
2   -2.680018
3    0.478470
4    2.404914
Name: something, dtype: float64

In [33]:
s.name

'something'

In [35]:
s2 = s.rename("different")
s2.name

'different'

Note that s and s2 refer to different objects.

## DataFrame

**DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame

Along with the data, you can optionally pass **index** (row labels) and **columns** (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame

**DataFrame.all(axis=None, bool_only=None, skipna=None, level=None, ****kwargs)
>Return whether all elements are True over requested axis

**Parameters:	**
>**axis** : {index (0), columns (1)}

>**skipna** : boolean, default True

>>Exclude NA/null values. If an entire row/column is NA, the result will be NA

>**level** : int or level name, default None
>>If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

>**bool_only** : boolean, default None

>>Include only boolean data. If None, will attempt to use everything, then use only boolean data

**Returns:	** all : Series or DataFrame (if level specified)

###  From dict of Series or dict

The result **index** will be the **union** of the indexes of the various Series. If there are any nested dicts, these will be first converted to Series. If no columns are passed, the columns will be the sorted list of dict keys.

In [36]:
d = {'one' : pd.Series([1., 2., 3.,], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [37]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [38]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


The row and column labels can be accessed respectively by accessing the **index** and **columns** attributes:

<div class="alert alert-block alert-info">**Note** When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.</div>

In [39]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [40]:
df.columns

Index(['one', 'two'], dtype='object')

###  From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

In [41]:
d = {'one' : [1., 2., 3., 4.],
     'two' : [4., 3., 2., 1] }

pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [42]:
pd.DataFrame(d, index=['a','b','c','d'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


### From structured or record array

This case is handled identically to a dict of arrays.

In [43]:
data = np.zeros((2,), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])

data

array([(0,  0., b''), (0,  0., b'')], 
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [45]:
data[:] = [(1,2.,'Hello'),(2,3.,'World')]

pd.DataFrame(data)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


In [46]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.0,b'World'


In [47]:
pd.DataFrame(data, columns=['C','A','B'])

Unnamed: 0,C,A,B
0,b'Hello',1,2.0
1,b'World',2,3.0


<div class="alert alert-block alert-info">**Note** DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.</div>

### From a dict of lists

In [48]:
data2 = [{'a' : 1, 'b' : 2},{'a' : 5, 'b' :10, 'c' : 20}]

pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [49]:
pd.DataFrame(data2, index = ['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [50]:
pd.DataFrame(data2, columns = ['a', 'b'])

Unnamed: 0,a,b
0,1,2
1,5,10


### From a dict of tuples

In [51]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
              ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
              ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
              ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
              ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,a,b,c,a,b
A,B,4.0,1.0,5.0,8.0,10.0
A,C,3.0,2.0,6.0,7.0,
A,D,,,,,9.0


### From a Series

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

#### Missing Data

Much more will be said on this topic in the __[Missing data section](http://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data)__. To construct a DataFrame with missing data, use np.nan for those values which are missing. Alternatively, you may pass a numpy.MaskedArray as the data argument to the DataFrame constructor, and its masked entries will be considered missing.

### Alternate Constructors