# Getting Started with pandas

- __Pandas contains data structures and data manipulation tools designed to make data cleaningand analysis fast and easy in Python__

- __While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data__

In [9]:
# Let's install pandas
import pandas as pd

In [10]:
from pandas import DataFrame, Series

### Introduction to pandas Data Structures
- __There are mainly three data structure DataFrame, Series, Panel__

### Series

- __A seris is a one-dimensional array-like object containing a sequece of values (of similar types of numpy types) and associated array of data labels, called its index__
- __The simplest series is formed from only an array of data__

##### Difference between pandas 'Series' and 'Numpy Array' 
- __The main difference is presence of index__
- __Numpy has an implicity defined integer index used to access values__
- __Pandas series has explicitly defined index associated with the values__



In [11]:
# Example :-

obj = pd.Series([3,-4,5,7])
obj

0    3
1   -4
2    5
3    7
dtype: int64

- __String representation of  series displayed interatively shows the 'index on the left' and 'values on the right'__

In [12]:
# object values

obj.values

array([ 3, -4,  5,  7])

In [13]:
#index

obj.index

RangeIndex(start=0, stop=4, step=1)

In [14]:
# we can even create index, for example:- 

obj2 = Series([1,2,3,4,5],index=('a','b','c','d','e'))
obj2

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [17]:
#acess value

obj2['a']

1

In [18]:
# acess multiple values

obj2[['a','b','e']]

a    1
b    2
e    5
dtype: int64

- __Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value__

In [20]:
import numpy as np

np.exp(obj2)

a      2.718282
b      7.389056
c     20.085537
d     54.598150
e    148.413159
dtype: float64

In [24]:
# select values base on condition

obj2[obj2>2]

c    3
d    4
e    5
dtype: int64

- __Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dic__

In [25]:
'e' in obj2

True

In [26]:
'f' in obj2

False

- __we can create 'series' from the dict__

In [27]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

to_series = pd.Series(sdata)
to_series

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

- __Here, three values found in 'sdata' were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number), which is considered in pandas to mark missing or NA values.__
- __Since 'Utah' was not included in states , it is excluded from the resulting object__

In [29]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

a_series = pd.Series(sdata, index = states)
a_series

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [36]:
# Let's check null values

pd.isnull(a_series)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [37]:
a_series.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

- __Let's add two series__

In [32]:
to_series

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [33]:
a_series

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [34]:
b_series = to_series + a_series
b_series

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


- __If you have experience with databases, you can think about as being similar to 'join opreration'__
- __If we see above california and utah's data is 'NAN beacuse they both are not present in 'to_series' and 'a_series'__



- __A Series’s index can be altered in-place by assignment__

In [38]:
obj

0    3
1   -4
2    5
3    7
dtype: int64

In [39]:
obj.index = ['a','b','c','d']
obj

a    3
b   -4
c    5
d    7
dtype: int64

### DataFrame

- __Dataframe represents a rectangular table of data and it contains an ordered collection of columns, each can have different value type (numeric, string, bolean)__
- __Dataframe has both row and column index__
- __Data is stored as one or more two-dimensional blocks rather than a list, dict or some other collection of one-dimensional arrays__

- __There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays__

In [41]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
       }

In [42]:
a_frame = pd.DataFrame(data)
a_frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


- __For large DataFrames, the head method selects only the first five rows:__

In [43]:
a_frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


- __rearranged columns__

In [47]:
pd.DataFrame(data, columns=['year','state','pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


- __Let's pass the column the aht does not contain in data__

In [49]:
pd.DataFrame(data, columns=['year','state','pop','debt'])

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


- __The comlun that does not contain data is 'NAN'__
- __Now let's pass index too__

In [56]:
b_frame = pd.DataFrame(data, columns=['year','state','pop','debt'],index=['a','b','c','d','e','f'])
b_frame

Unnamed: 0,year,state,pop,debt
a,2000,Ohio,1.5,
b,2001,Ohio,1.7,
c,2002,Ohio,3.6,
d,2001,Nevada,2.4,
e,2002,Nevada,2.9,
f,2003,Nevada,3.2,


In [57]:
#Now phase value
b_frame.year

a    2000
b    2001
c    2002
d    2001
e    2002
f    2003
Name: year, dtype: int64

In [58]:
b_frame['year']

a    2000
b    2001
c    2002
d    2001
e    2002
f    2003
Name: year, dtype: int64

In [63]:
b_frame.loc['c']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: c, dtype: object

- __column can be added__
- __For example__

In [64]:
b_frame['debt'] = 10

b_frame

Unnamed: 0,year,state,pop,debt
a,2000,Ohio,1.5,10
b,2001,Ohio,1.7,10
c,2002,Ohio,3.6,10
d,2001,Nevada,2.4,10
e,2002,Nevada,2.9,10
f,2003,Nevada,3.2,10


In [70]:
# Another example
# when you assign the value it's length must equal to DataFrame's length
import numpy as np

b_frame['debt'] = np.arange(6)
b_frame

Unnamed: 0,year,state,pop,debt
a,2000,Ohio,1.5,0
b,2001,Ohio,1.7,1
c,2002,Ohio,3.6,2
d,2001,Nevada,2.4,3
e,2002,Nevada,2.9,4
f,2003,Nevada,3.2,5


In [73]:
# Let's insert series into dataframes column
val = pd.Series([1.2,3,4.5], index = ['a','c','f'])

b_frame['debt'] = val
b_frame

Unnamed: 0,year,state,pop,debt
a,2000,Ohio,1.5,1.2
b,2001,Ohio,1.7,
c,2002,Ohio,3.6,3.0
d,2001,Nevada,2.4,
e,2002,Nevada,2.9,
f,2003,Nevada,3.2,4.5


In [74]:
# if we assign that does'nt exist it will create new one

b_frame['eastern'] = 'checkvalue'
b_frame

Unnamed: 0,year,state,pop,debt,eastern
a,2000,Ohio,1.5,1.2,checkvalue
b,2001,Ohio,1.7,,checkvalue
c,2002,Ohio,3.6,3.0,checkvalue
d,2001,Nevada,2.4,,checkvalue
e,2002,Nevada,2.9,,checkvalue
f,2003,Nevada,3.2,4.5,checkvalue


- __To delete column we use 'del' keybword__

In [76]:
del b_frame['eastern']
b_frame

Unnamed: 0,year,state,pop,debt
a,2000,Ohio,1.5,1.2
b,2001,Ohio,1.7,
c,2002,Ohio,3.6,3.0
d,2001,Nevada,2.4,
e,2002,Nevada,2.9,
f,2003,Nevada,3.2,4.5


- __Create DataFrame from nested dict of dict__

In [77]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}
      }

c_frame = pd.DataFrame(pop)
c_frame

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


- __We can also tranpose the value of dataframe__

In [82]:
c_frame.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


- __Let's find the vlaues and keys of dataframe__


In [83]:

c_frame.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [84]:
c_frame.keys

<bound method NDFrame.keys of       Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5>

### Index Objects


- __pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names)__

In [85]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
obj

a    0
b    1
c    2
dtype: int64

In [86]:
obj.index

Index(['a', 'b', 'c'], dtype='object')

- __object index are immutable (Which means it can not be modified)__