# Data Prep - Part 1

## DataFrames and Series objects

In [4]:
import numpy as np
import pandas as pd

from pandas import DataFrame, Series

### Selecting and retreiving data

First lets create a Series object. A Series object represents a one-dimensional array - note that it is possible to specify custom index values (will default to 0,1,.... otherwise)

In the examples below we use the `np.arange` method to give us a 'series' of numbers

In [7]:
s1 = Series(np.arange(10))
s1

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [8]:
s2 = Series(np.arange(10), index=['r1','r2','r3','r4','r5','r6','r7','r8','r9','r10'])
s2

r1     0
r2     1
r3     2
r4     3
r5     4
r6     5
r7     6
r8     7
r9     8
r10    9
dtype: int64

To retrieve data from a Series object you can specify the appropriate index value as we see below. Note that we can use the 'custom' index value or a more conventional zero-based index value

In [12]:
print(s2['r4'])
print(s2[['r2', 'r8']])
print(s2[[2,3]])

3
r2    1
r8    7
dtype: int64
r3    2
r4    3
dtype: int64


### Create a DataFrame

In this example - we have 36 random numbers with 6 rows and 6 columns. Custom row / col labels are specified.

In [13]:
np.random.seed(25)
df = DataFrame(np.random.rand(36).reshape((6,6)),
    index=['r1','r2','r3','r4','r5','r6'],
    columns=['c1','c2','c3','c4','c5','c6'])
df

Unnamed: 0,c1,c2,c3,c4,c5,c6
r1,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
r2,0.684969,0.437611,0.556229,0.36708,0.402366,0.113041
r3,0.447031,0.585445,0.161985,0.520719,0.326051,0.699186
r4,0.366395,0.836375,0.481343,0.516502,0.383048,0.997541
r5,0.514244,0.559053,0.03445,0.71993,0.421004,0.436935
r6,0.281701,0.900274,0.669612,0.456069,0.289804,0.525819


To select data from a specific set of row / columns - specify the row / column indexes

In [15]:
df.loc[['r1','r3'], ['c1','c3']]

Unnamed: 0,c1,c3
r1,0.870124,0.278839
r3,0.447031,0.161985


### Data Slicing

It is possible therefore to select a subset of data from a dataframe. When slicing the index values you use are separated by a ':' and this indicates that you want all the data between these index possitions.
Here is an example on a Series object :-

In [17]:
s2['r1':'r4']

r1    0
r2    1
r3    2
r4    3
dtype: int64

And another example using a DataFrame :-

In [20]:
df.loc['r2':'r5','c2':'c3']

Unnamed: 0,c2,c3
r2,0.437611,0.556229
r3,0.585445,0.161985
r4,0.836375,0.481343
r5,0.559053,0.03445


### Comparison operators and Scalar values

It is possible to compare 'all' values in a dataframe using a 'comparison' operator and a 'scalar' value. What is returned is a matrix like shape with True / False values 

In [21]:
df < 0.4

Unnamed: 0,c1,c2,c3,c4,c5,c6
r1,False,False,True,True,False,True
r2,False,False,False,True,False,True
r3,False,False,True,False,True,False
r4,True,False,False,False,True,False
r5,False,False,True,False,False,False
r6,True,False,False,False,True,False


### Filtering with Scalars

On a Series object it looks like this :-

In [22]:
s2[s2 > 4]

r6     5
r7     6
r8     7
r9     8
r10    9
dtype: int64

### Setting values with Scalars

On a Series object it looks like this - you can set multiple entries to the scalar value by speciifying the  appropriate indexes :-

In [27]:
s2[['r1', 'r2']] = 12
s2

r1     12
r2     12
r3      2
r4      3
r5      4
r6      5
r7      6
r8      7
r9      8
r10     9
dtype: int64