# Week4: Introduction Pandas 

From this stage Pandas will be the main library we will use to study our data and apply machine learning algorithms. Pandas inherits significant parts of NumPy's idiomatic style of array based computing. Unlike numpy, Pandas allows for heterogeneous data. 

In [2]:
import numpy as np
import pandas as pd

Pandas is built around 2 main data structures: __Series__ and __DataFrames__. A __Series__ could be seen as a simple one dimentiona array like object with a sequence of values along with an associated array of labels also known as index.

In [3]:
s_1  = pd.Series([3,4,67,1])

In [4]:
s_1

0     3
1     4
2    67
3     1
dtype: int64

Notice the sequence on the right and on the left we see the index array (0->1). Since we did not specify an index for the data a default one consisting of integers 1 to N is automatically generated by pandas. 

In [5]:
s_1.values

array([ 3,  4, 67,  1])

In [6]:
(s_1.index)

RangeIndex(start=0, stop=4, step=1)

In [7]:
type(s_1.values)

numpy.ndarray

In some cases you would want to create a Series with a specific index identifying each data point:

In [8]:
s_2 = pd.Series([0,1,4,2,5],index=['blue','black','yellow','red','white'])

In [9]:
s_2

blue      0
black     1
yellow    4
red       2
white     5
dtype: int64

In [10]:
s_2.index

Index(['blue', 'black', 'yellow', 'red', 'white'], dtype='object')

Compared to numpy we can use the labels as an index when selecting (indexing) values:

In [11]:
s_2['white']

5

In [12]:
s_2[['red','yellow']]

red       2
yellow    4
dtype: int64

As youcan see fancy and boolean indexing also work with pandas as with numpy.

In [13]:
s_2[s_2==1]

black    1
dtype: int64

Applying scar multiplication and math funcion work as well:

In [14]:
s_2 *3

blue       0
black      3
yellow    12
red        6
white     15
dtype: int64

In [15]:
np.exp(s_2)

blue        1.000000
black       2.718282
yellow     54.598150
red         7.389056
white     148.413159
dtype: float64

A Series could also be used in a similar way to how we used dictionaries:

In [16]:
'red' in s_2

True

This also means that if we convert easily a dictionnary into a Series: 

In [17]:
my_dict = {'Tony':36,'James':42,'Rebecca':26}

In [18]:
my_s = pd.Series(my_dict)

In [19]:
my_s

Tony       36
James      42
Rebecca    26
dtype: int64

The key defaulted to the index here but if we want to override this by passing dict keys in the order you them to appear in:

In [20]:
names = ['Sarah','Rebecca','James']

In [21]:
my_s2 = pd.Series(my_dict, index=names)

In [22]:
my_s2

Sarah       NaN
Rebecca    26.0
James      42.0
dtype: float64

Notice that the two values found in my_dict were placed in the appropriate locations, but since no value for 'Sarah' was found, it appears as NaN (not a number),which is considered in pandas to mark missing or NA values. Since 'Tony' was not included within the list of names, it does not apprear in the output Series. The isnull and notnull functions in pandas should be used to detect missing data: 



In [23]:
pd.isnull(my_s2)

Sarah       True
Rebecca    False
James      False
dtype: bool

In [24]:
pd.notnull(my_s2)

Sarah      False
Rebecca     True
James       True
dtype: bool

Those two methods could be applied directly to the corresponding Series objects:

In [25]:
my_s2.isnull()

Sarah       True
Rebecca    False
James      False
dtype: bool

What happens when we apply arithmetic operations on series?

In [26]:
my_s + my_s2

James      84.0
Rebecca    52.0
Sarah       NaN
Tony        NaN
dtype: float64

Name attributes could be assigned to the Series values and index:

In [27]:
my_s.name='Age'

In [28]:
my_s.index.name='Employee'

In [29]:
my_s

Employee
Tony       36
James      42
Rebecca    26
Name: Age, dtype: int64

A series index can always be altered in-place by assignement:

In [30]:
my_s.index=['Simon','Lara','Mary']

In [31]:
my_s

Simon    36
Lara     42
Mary     26
Name: Age, dtype: int64

In [32]:
print ('test')

test


## DataFrame

A __DataFrame__ constitute a table of data containing an ordered and indexed set of columns and rows. It could seen as a collection of Series or a two dimentional Series. Like a Serie, there are many ways to create a __DataFrame__.

In [33]:
data={'state':['New York','New York','Alabama','Alabama','Alabama','Alabama'],'year':[2000,2001,2001,2002,2003,2004],'rank':[10,12,3,11,5,8]}

In [34]:
data

{'state': ['New York', 'New York', 'Alabama', 'Alabama', 'Alabama', 'Alabama'],
 'year': [2000, 2001, 2001, 2002, 2003, 2004],
 'rank': [10, 12, 3, 11, 5, 8]}

In [35]:
frame = pd.DataFrame(data)

In [36]:
frame

Unnamed: 0,state,year,rank
0,New York,2000,10
1,New York,2001,12
2,Alabama,2001,3
3,Alabama,2002,11
4,Alabama,2003,5
5,Alabama,2004,8


In the abscence of explicit index specification pandas automatically assigns a range index as we saw in Series

In [37]:
frame.head() #selects and shows the first five rows

Unnamed: 0,state,year,rank
0,New York,2000,10
1,New York,2001,12
2,Alabama,2001,3
3,Alabama,2002,11
4,Alabama,2003,5


In [38]:
pd.DataFrame(data,columns=['year','state','rank']) # the dataframe created with columns in that order

Unnamed: 0,year,state,rank
0,2000,New York,10
1,2001,New York,12
2,2001,Alabama,3
3,2002,Alabama,11
4,2003,Alabama,5
5,2004,Alabama,8


And for specifying in the index:

In [39]:
frame_2 = pd.DataFrame(data,columns=['year','state','rank'],index=['a','b','c','d','e','f'])

In [40]:
frame_2

Unnamed: 0,year,state,rank
a,2000,New York,10
b,2001,New York,12
c,2001,Alabama,3
d,2002,Alabama,11
e,2003,Alabama,5
f,2004,Alabama,8


In [41]:
frame_2.columns

Index(['year', 'state', 'rank'], dtype='object')

In [42]:
frame_2['state']

a    New York
b    New York
c     Alabama
d     Alabama
e     Alabama
f     Alabama
Name: state, dtype: object

In [43]:
frame_2.state

a    New York
b    New York
c     Alabama
d     Alabama
e     Alabama
f     Alabama
Name: state, dtype: object

In [44]:
type(frame_2.state)

pandas.core.series.Series

In [45]:
frame_2.loc['b'] # To retrieve rows

year         2001
state    New York
rank           12
Name: b, dtype: object

New columns could be assigned as simply as:

In [46]:
frame_2['rate']=9.0

In [47]:
frame_2

Unnamed: 0,year,state,rank,rate
a,2000,New York,10,9.0
b,2001,New York,12,9.0
c,2001,Alabama,3,9.0
d,2002,Alabama,11,9.0
e,2003,Alabama,5,9.0
f,2004,Alabama,8,9.0


or alternatively:

In [48]:
frame_2['rate']=range(6)

In [49]:
frame_2

Unnamed: 0,year,state,rank,rate
a,2000,New York,10,0
b,2001,New York,12,1
c,2001,Alabama,3,2
d,2002,Alabama,11,3
e,2003,Alabama,5,4
f,2004,Alabama,8,5


Inseting columns with a different length or with whole is done as follows:

In [51]:
val = pd.Series([-1.2, -1.5, -1.7], index=['b', 'd', 'e'])
frame_2['rate'] = val
frame_2

Unnamed: 0,year,state,rank,rate
a,2000,New York,10,
b,2001,New York,12,-1.2
c,2001,Alabama,3,
d,2002,Alabama,11,-1.5
e,2003,Alabama,5,-1.7
f,2004,Alabama,8,


Assigning a column hat doesn't exist will create a new column. The __del__ keyword will delete columns as with a dict. We fist add a new column and the illustrate how to delete it:

In [55]:
frame_2['eastern'] = frame_2.state == 'Alabama'
frame_2

Unnamed: 0,year,state,rank,rate,eastern
a,2000,New York,10,,False
b,2001,New York,12,-1.2,False
c,2001,Alabama,3,,True
d,2002,Alabama,11,-1.5,True
e,2003,Alabama,5,-1.7,True
f,2004,Alabama,8,,True


Above we created a boolean column where the state is equal New York.

In [57]:
del frame_2['eastern']
frame_2.columns

Index(['year', 'state', 'rank', 'rate'], dtype='object')

![DataFrame](Dataframe.jpeg)