<img style="float: left;" src="pic2.png">

### Sridhar Palle, Ph.D, spalle@emory.edu (Applied ML & DS with Python Program)

<img style = 'float:left;' src = 'pand.png'> 

* Overcomes the limitations of numpy arrays in dealing with labelled data
* Built on top of Numpy
* Introduces Series and DataFrame datastructures, which take on most of data mining

## 1. Pandas Series
* One-dimensional array of indexed data.
* Built on top of numpy 1D array, and also has features similar to dictionaries.

In [None]:
# It can be thought of like a labelled 1-D numpy array or a specialized dictionary

In [None]:
# Pandas series differ form numpy arrays in that, we can define labels for the series index

In [1]:
import numpy as np
import pandas as pd
pd.__version__

'0.24.2'

### 1.1 Creating pandas series objects
* Can be created from a list, numpy array, or tuple, or dictionary

**Creating series from a list**

In [2]:
my_series = pd.Series([0.5,0.8,0.9,1.3,0.4]) # similar to numpy array except now we have explicit index.
my_series

0    0.5
1    0.8
2    0.9
3    1.3
4    0.4
dtype: float64

In [3]:
my_series.values # returns the value of the series

array([0.5, 0.8, 0.9, 1.3, 0.4])

In [4]:
list(my_series.index) # returns an index object. We can get the same thing through my_series.keys()

[0, 1, 2, 3, 4]

In [5]:
my_series = pd.Series([1,2,3,4], index = ["a","b","c","d"]) # index can be changed.
my_series

a    1
b    2
c    3
d    4
dtype: int64

In [6]:
my_series.shape # It can be seen that pandas series is one-dimensional array.

(4,)

**creating series from a numpy array**

In [7]:
my_arr = np.array([8,7,6,5]) # creating Series from a numpy array
pd.Series(my_arr)

0    8
1    7
2    6
3    5
dtype: int64

**creating series from a tuple**

In [8]:
tp = (11,12,9,1)
pd.Series(tp) # Creating series from a tuple.

0    11
1    12
2     9
3     1
dtype: int64

**creating series from a dictionary**

In [9]:
my_dict = {'k1':1, 'k2':3, 'k4':5} # dictionary can be easily converted to series. keys become index, values are values.
pd.Series(my_dict)

k1    1
k2    3
k4    5
dtype: int64

In [10]:
my_series

a    1
b    2
c    3
d    4
dtype: int64

In [11]:
my_series2 = pd.Series([5,6,7,8], index = ["a","b","c","d"])
my_series2

a    5
b    6
c    7
d    8
dtype: int64

In [12]:
my_series+my_series2 # we can do element wise operations just like numpy arrays

a     6
b     8
c    10
d    12
dtype: int64

In [13]:
d3 = {"k1": 2, "k2":[1,2,3], "k3":("a","b"),"k4":{"sk1":5}}

In [14]:
d3

{'k1': 2, 'k2': [1, 2, 3], 'k3': ('a', 'b'), 'k4': {'sk1': 5}}

In [15]:
pd.Series(d3)

k1             2
k2     [1, 2, 3]
k3        (a, b)
k4    {'sk1': 5}
dtype: object

### 1.2 Indexing & Modifying Series Objects

#### 1.2.1 Indexing
* Use square brackets [] for indexing

In [16]:
my_series = pd.Series([1.5,0.2,1.3,0.4], index=['a', 'b', 'c', 'd'])
my_series

a    1.5
b    0.2
c    1.3
d    0.4
dtype: float64

In [17]:
my_series['a'] # explicit indexing by name of index (similar to indexing of a dictionary)

1.5

In [18]:
my_series[0:3] # implicit indexing from 0th row to 2nd row.

a    1.5
b    0.2
c    1.3
dtype: float64

In [19]:
my_series[3::-1] # getting the series elements in a reverse order

d    0.4
c    1.3
b    0.2
a    1.5
dtype: float64

**potential for confusion with this indexing**

In [20]:
my_series2 = pd.Series(['a', 'b', 'c', 'd', 'e'], index = [1,2,3,4,5])
my_series2

1    a
2    b
3    c
4    d
5    e
dtype: object

In [21]:
my_series2[1]

'a'

In [22]:
my_series2[1:4]

2    b
3    c
4    d
dtype: object

**best to use special indexing attributes (mainly for pandas series and dataframes)**
* .loc - for explicit indexing with names
* .iloc - for implicit indexing with row numbers or column numbers
* .ix - for mixed indexing

In [23]:
my_series2

1    a
2    b
3    c
4    d
5    e
dtype: object

In [24]:
my_series2.loc[3] # with .loc, its explicit indexing, provide actual row or index names

'c'

In [25]:
my_series2.iloc[3] # with .iloc, its implicit indexing,  provide row  index

'd'

In [26]:
my_series2.iloc[1:]

2    b
3    c
4    d
5    e
dtype: object

In [27]:
my_series2.loc[1:]

1    a
2    b
3    c
4    d
5    e
dtype: object

#### 1.2.2 Modifying Series

In [28]:
my_series2 = pd.Series(['a', 'b', 'c', 'd', 'e'], index = [1,2,3,4,5])
my_series2

1    a
2    b
3    c
4    d
5    e
dtype: object

In [29]:
my_series2

1    a
2    b
3    c
4    d
5    e
dtype: object

In [30]:
my_series2.loc[5] = 'y'
my_series2

1    a
2    b
3    c
4    d
5    y
dtype: object

In [31]:
my_series2.append(pd.Series(['f'])) # series is one-dimensional. append can be used to add additional series elements.
#not happening in place

1    a
2    b
3    c
4    d
5    y
0    f
dtype: object

In [32]:
my_series2 

1    a
2    b
3    c
4    d
5    y
dtype: object

In [33]:
my_series2.loc[6] = 'z'
my_series2

1    a
2    b
3    c
4    d
5    y
6    z
dtype: object

In [34]:
my_series2.drop(5) # drop index with name or label 5. not happening in place

1    a
2    b
3    c
4    d
6    z
dtype: object

In [35]:
my_series2

1    a
2    b
3    c
4    d
5    y
6    z
dtype: object

## 2. Pandas DataFrames
*  2D array with labelled data.
* Sequence of aligned Series objects.
* Generalization of numpy array with labelled data

In [None]:
# Basically each row or column is a series

In [None]:
# Data frame can be thought of as a 2D numpy array with labelled rows and columns or 
# .....also like a dictionary where column name is the key, and column data are values..

### 2.1 Creating Data Frames

In [36]:
pd.DataFrame

pandas.core.frame.DataFrame



#### Init signature: pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

In [None]:
# DataFrames can be created by reading csv files, from from a 2D numpy array, a dictionary of series objects, 
# from a dictionary of list objects.

**Creating data frame from a 2D numpy array**

In [37]:
df = pd.DataFrame(np.random.rand(4,5))
df

Unnamed: 0,0,1,2,3,4
0,0.317397,0.685026,0.601929,0.381939,0.721429
1,0.653274,0.060569,0.434749,0.453427,0.570158
2,0.040836,0.975814,0.960774,0.288605,0.816913
3,0.93291,0.550642,0.530056,0.663251,0.982677


In [38]:
df = pd.DataFrame(np.random.rand(4,5), index = 'r1 r2 r3 r4'.split(), columns = 'c1 c2 c3 c4 c5'.split())
df

Unnamed: 0,c1,c2,c3,c4,c5
r1,0.980654,0.58029,0.079541,0.59453,0.843905
r2,0.774207,0.228645,0.544318,0.976676,0.095036
r3,0.206792,0.226354,0.94472,0.780921,0.661568
r4,0.207371,0.035855,0.700011,0.861092,0.330183


In [39]:
list(df.index) #  We get the index (basically rownames) 

['r1', 'r2', 'r3', 'r4']

In [40]:
list(df.columns) # to obtain column names, df.keys() will get you the same result.

['c1', 'c2', 'c3', 'c4', 'c5']

In [41]:
df.keys()

Index(['c1', 'c2', 'c3', 'c4', 'c5'], dtype='object')

**Creating data frame from a dictionary of lists**

In [42]:
df = pd.DataFrame({"k1":[1,2,3], "k2":["a","b","c"]}) 
# no need to specify column names, because dictionary keys are same as columns for dataframes

In [43]:
df

Unnamed: 0,k1,k2
0,1,a
1,2,b
2,3,c


**from lists**

In [44]:
df = pd.DataFrame([[1,2,3,4]]) # we need to pass a list of lists
df

Unnamed: 0,0,1,2,3
0,1,2,3,4


**from list of lists**

In [45]:
df = pd.DataFrame([[1,2,3,4],["a","b","c","d"]], index='r1,r2'.split(","), columns='c1,c2,c3,c4'.split(","))
df

Unnamed: 0,c1,c2,c3,c4
r1,1,2,3,4
r2,a,b,c,d


**Creating data frame from a dictionary of series or dictionary objects**

In [46]:
d1 = {'Josh': 6, 'kevin': 5.5, 'kumar': 5.8, 'shelly': 4.9}
d2 = {'Josh': 180 , 'kevin': 150, 'kumar': 140, 'shelly': 120}
s1 = pd.Series(d1)
s2 = pd.Series(d2)
print ('s1\n', s1, '\n')
print ('s2\n', s2)

s1
 Josh      6.0
kevin     5.5
kumar     5.8
shelly    4.9
dtype: float64 

s2
 Josh      180
kevin     150
kumar     140
shelly    120
dtype: int64


In [47]:
df = pd.DataFrame({'height':s1, 'weight':s2}) # created from a dictionary of series objects. 
df

Unnamed: 0,height,weight
Josh,6.0,180
kevin,5.5,150
kumar,5.8,140
shelly,4.9,120


In [48]:
df = pd.DataFrame({'height':d1, 'weight':d2}) # it can also be created from a dictionary of dictionaries
df

Unnamed: 0,height,weight
Josh,6.0,180
kevin,5.5,150
kumar,5.8,140
shelly,4.9,120


**Creating data frame by reading data from a csv file (pd.read_csv)**

In [49]:
pwd

'/Users/rajpardasani/Documents/GitHub/data_science/numpy_pandas_intro'

In [50]:
df = pd.read_csv('wine.csv') # this is the more common way, we will deal with data frames

In [51]:
df.shape # to get the dimensions. 

(178, 15)

In [52]:
df.head(3) # show first few rows. 

Unnamed: 0.1,Unnamed: 0,Type,Alcohol,Malic,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color,Hue,Dilution,Proline
0,1,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,2,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,3,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185


**df.info()** will give a brief info on data frame columns and types. similar to str(df) in R 

**df.describe()** will give brief staistics on all numeric type columns. similar to summary(df) in R

In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 15 columns):
Unnamed: 0         178 non-null int64
Type               178 non-null int64
Alcohol            178 non-null float64
Malic              178 non-null float64
Ash                178 non-null float64
Alcalinity         178 non-null float64
Magnesium          178 non-null int64
Phenols            178 non-null float64
Flavanoids         178 non-null float64
Nonflavanoids      178 non-null float64
Proanthocyanins    178 non-null float64
Color              178 non-null float64
Hue                178 non-null float64
Dilution           178 non-null float64
Proline            178 non-null int64
dtypes: float64(11), int64(4)
memory usage: 20.9 KB


In [54]:
df.describe() # brief statistics

Unnamed: 0.1,Unnamed: 0,Type,Alcohol,Malic,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color,Hue,Dilution,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,89.5,1.938202,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,51.528309,0.775035,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,1.0,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,45.25,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,89.5,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,133.75,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,178.0,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


**methods have df.(), and others just df.    This is because methods use df.method(), and attributes just df.attribute**

**Method does some kind of action or operations. attributes are just features. they just state some info.**


### 2.2 Indexing/slicing/ DataFrame
    - df.loc[] for explicit indexing on rows and columns (with names)
    - df.iloc[] for implicit indexing on rows and columns (with index position)
    - df[] for explicit indexing with column names


In [55]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,Type,Alcohol,Malic,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color,Hue,Dilution,Proline
0,1,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,2,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,3,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185


In [56]:
df['Alcohol'].head() # to get a particular column

0    14.23
1    13.20
2    13.16
3    14.37
4    13.24
Name: Alcohol, dtype: float64

In [57]:
df[["Alcohol","Phenols"]].head(3) # pass a list of columns, for more than one column

Unnamed: 0,Alcohol,Phenols
0,14.23,2.8
1,13.2,2.65
2,13.16,2.8


In [58]:
rnames = ['r' + str(i) for i in range(0,178)] # creating a list of row names
df.index = rnames # assign these row names to df index
df.head()

Unnamed: 0.1,Unnamed: 0,Type,Alcohol,Malic,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color,Hue,Dilution,Proline
r0,1,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
r1,2,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
r2,3,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
r3,4,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
r4,5,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [59]:
df.loc['r3'] # use .loc to explict indexing on rows. for explicit indexing on columns .loc optional. But better to use .loc

Unnamed: 0            4.00
Type                  1.00
Alcohol              14.37
Malic                 1.95
Ash                   2.50
Alcalinity           16.80
Magnesium           113.00
Phenols               3.85
Flavanoids            3.49
Nonflavanoids         0.24
Proanthocyanins       2.18
Color                 7.80
Hue                   0.86
Dilution              3.45
Proline            1480.00
Name: r3, dtype: float64

In [60]:
df.loc['r3',:] # row 'r3', all columns. Also better to use ',' between rows and columns

Unnamed: 0            4.00
Type                  1.00
Alcohol              14.37
Malic                 1.95
Ash                   2.50
Alcalinity           16.80
Magnesium           113.00
Phenols               3.85
Flavanoids            3.49
Nonflavanoids         0.24
Proanthocyanins       2.18
Color                 7.80
Hue                   0.86
Dilution              3.45
Proline            1480.00
Name: r3, dtype: float64

In [61]:
df.head()

Unnamed: 0.1,Unnamed: 0,Type,Alcohol,Malic,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color,Hue,Dilution,Proline
r0,1,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
r1,2,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
r2,3,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
r3,4,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
r4,5,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [62]:
df.iloc[2:4,1:4] # to get 2,3 rows and 1,2,3 columns

Unnamed: 0,Type,Alcohol,Malic
r2,1,13.16,2.36
r3,1,14.37,1.95


In [63]:
df.loc[['r3','r4','r5'],['Alcohol', 'Malic', 'Ash']].head() 
# when selecting multiple columns or rows, pass as a list. Also use .loc for indexing with names

Unnamed: 0,Alcohol,Malic,Ash
r3,14.37,1.95,2.5
r4,13.24,2.59,2.87
r5,14.2,1.76,2.45


### 2.3 Few important dataframe methods

In [64]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,Type,Alcohol,Malic,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color,Hue,Dilution,Proline
r0,1,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
r1,2,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
r2,3,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185


**df.max(), df.min(), df.sum(), df.mean(), df.idxmax(), df.idxmin() and many more**

**df.max()**

In [65]:
df['Alcohol'].max() # to get the maximum value of a particular column

14.83

In [66]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,Type,Alcohol,Malic,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color,Hue,Dilution,Proline
r0,1,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
r1,2,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
r2,3,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185


In [67]:
df.max(axis=0) # df.max(axis=0 or 1) to get the maximum value along each column or row. By default axis = 0
#axis = 0 will collapse values along rows, axis =1 will collapse along columns

Unnamed: 0          178.00
Type                  3.00
Alcohol              14.83
Malic                 5.80
Ash                   3.23
Alcalinity           30.00
Magnesium           162.00
Phenols               3.88
Flavanoids            5.08
Nonflavanoids         0.66
Proanthocyanins       3.58
Color                13.00
Hue                   1.71
Dilution              4.00
Proline            1680.00
dtype: float64

**df.idxmax()**

In [68]:
df['Alcohol'].idxmax()# this gives the index name where value is max, similar to which.max() in R

'r8'

In [69]:
df.idxmax() # if we do it on whole data frame, it will give the index value for each column, where max value occurs

Unnamed: 0         r177
Type               r130
Alcohol              r8
Malic              r123
Ash                r121
Alcalinity          r73
Magnesium           r95
Phenols             r52
Flavanoids         r121
Nonflavanoids      r105
Proanthocyanins    r110
Color              r158
Hue                r115
Dilution            r22
Proline             r18
dtype: object

**df.mean()**

In [70]:
df.mean(axis=0) # to get the mean of each column

Unnamed: 0          89.500000
Type                 1.938202
Alcohol             13.000618
Malic                2.336348
Ash                  2.366517
Alcalinity          19.494944
Magnesium           99.741573
Phenols              2.295112
Flavanoids           2.029270
Nonflavanoids        0.361854
Proanthocyanins      1.590899
Color                5.058090
Hue                  0.957449
Dilution             2.611685
Proline            746.893258
dtype: float64

**df.apply()**

df.apply(func, axis) can be used to apply any function across all columns or rows. similart to apply() in R.

functions can also be user defined functions.

In [71]:
df.apply(np.mean, 0)  # same as above result. ""
#instead of np.mean,we could have applied any function here including user defined one

Unnamed: 0          89.500000
Type                 1.938202
Alcohol             13.000618
Malic                2.336348
Ash                  2.366517
Alcalinity          19.494944
Magnesium           99.741573
Phenols              2.295112
Flavanoids           2.029270
Nonflavanoids        0.361854
Proanthocyanins      1.590899
Color                5.058090
Hue                  0.957449
Dilution             2.611685
Proline            746.893258
dtype: float64

In [72]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,Type,Alcohol,Malic,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color,Hue,Dilution,Proline
r0,1,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
r1,2,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050


**df[ ].unique()**  to get all the unique elements in a column

In [73]:
df['Type'].unique() # lists all the unique elements in wine 'Type' column

array([1, 2, 3])

**df[ ].nunique()** to obtain total # of unique elements

In [74]:
df['Type'].nunique() # total number of unique values

3

**df[ ].value_counts()** unique values and their frequencies

In [75]:
df['Type'].value_counts() # frequency of each unique value.  In R we simply use table(df['c1']) to get unique values

2    71
1    59
3    48
Name: Type, dtype: int64

**df.sort_values('column', axis)** To sort a dataframe based on a particular column

In [76]:
df.sort_values('Alcohol') # In R we can use the order() which gives sorted indices, which can then be used in df[order(),]

Unnamed: 0.1,Unnamed: 0,Type,Alcohol,Malic,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color,Hue,Dilution,Proline
r115,116,2,11.03,1.51,2.20,21.5,85,2.46,2.17,0.52,2.01,1.90,1.71,2.87,407
r113,114,2,11.41,0.74,2.50,21.0,88,2.48,2.01,0.42,1.44,3.08,1.10,2.31,434
r120,121,2,11.45,2.40,2.42,20.0,96,2.90,2.79,0.32,1.83,3.25,0.80,3.39,625
r110,111,2,11.46,3.74,1.82,19.5,107,3.18,2.58,0.24,3.58,2.90,0.75,2.81,562
r121,122,2,11.56,2.05,3.23,28.5,119,3.18,5.08,0.47,1.87,6.00,0.93,3.69,465
r109,110,2,11.61,1.35,2.70,20.0,94,2.74,2.92,0.29,2.49,2.65,0.96,3.26,680
r94,95,2,11.62,1.99,2.28,18.0,98,3.02,2.26,0.17,1.35,3.25,1.16,2.96,345
r88,89,2,11.64,2.06,2.46,21.6,84,1.95,1.69,0.48,1.35,2.80,1.00,2.75,680
r87,88,2,11.65,1.67,2.62,26.0,88,1.92,1.61,0.40,1.34,2.60,1.36,3.21,562
r75,76,2,11.66,1.88,1.92,16.0,97,1.61,1.57,0.34,1.15,3.80,1.23,2.14,428


**df.groupby()** Another powerful method to group the data frame based on a column, and then apply some function

In [77]:
d = {"item":["chair", "desk", "rug", "table", "chair", "couch", "couch", "chair", "rug", "desk"], 
     "agent":["sally", "bob", "sally", "amy", "bob", "amy", "sally", "bob", "amy", "sally"],
      "sale_price": [100, 110, 200, 100, 150, 800, 1000, 100, 85, 110],
        "quantity": [10, 5, 15, 8, 20, 5, 4, 11, 9, 5]}

In [78]:
df = pd.DataFrame(d)
df['revenue'] = df['quantity']*df['sale_price']
df

Unnamed: 0,item,agent,sale_price,quantity,revenue
0,chair,sally,100,10,1000
1,desk,bob,110,5,550
2,rug,sally,200,15,3000
3,table,amy,100,8,800
4,chair,bob,150,20,3000
5,couch,amy,800,5,4000
6,couch,sally,1000,4,4000
7,chair,bob,100,11,1100
8,rug,amy,85,9,765
9,desk,sally,110,5,550


In [79]:
df.groupby("item").mean() # only quantity, sale_price, revenue are returned because they are numeric and agent is not.

Unnamed: 0_level_0,sale_price,quantity,revenue
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
chair,116.666667,13.666667,1700.0
couch,900.0,4.5,4000.0
desk,110.0,5.0,550.0
rug,142.5,12.0,1882.5
table,100.0,8.0,800.0


In [80]:
df.groupby("agent")['quantity', 'revenue'].sum() # we can also specify the columns on which we want the mean.

Unnamed: 0_level_0,quantity,revenue
agent,Unnamed: 1_level_1,Unnamed: 2_level_1
amy,22,5565
bob,36,4650
sally,34,8550


## 2.3 Boolean Masking Data Frames

In [81]:
boston = pd.read_csv('boston_housing.csv')

In [82]:
boston.shape

(506, 15)

In [83]:
boston.head(3)

Unnamed: 0.1,Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7


In [84]:
boston.describe()

Unnamed: 0.1,Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,252.5,3.593761,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,146.213884,8.596783,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.0,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,126.25,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,252.5,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,378.75,3.647422,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,505.0,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [85]:
boston.loc[(boston['RM'] > 8), 'MEDV' ] # finding MEDV when # of rooms are high > 8

97     38.7
163    50.0
204    50.0
224    44.8
225    50.0
226    37.6
232    41.7
233    48.3
253    42.8
257    50.0
262    48.8
267    50.0
364    21.9
Name: MEDV, dtype: float64

In [86]:
boston.loc[(boston['RM'] > 8), 'MEDV' ].mean()
# always remember to use .loc for rows when using names or booleans

44.2

In [87]:
boston.loc[boston['NOX'] > 0.8,'MEDV'].mean() # Finding the mean of MEDV where Nox is high.

16.425

In [88]:
boston.loc[boston['NOX'] < 0.4,'MEDV'].mean() # Finding the mean of MEDV where Nox is low.

24.54285714285714

**.ix allows a hybrid approach to indexing (hybrid between loc and iloc)**

In [89]:
boston.ix[0:3, ['NOX','MEDV']]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,NOX,MEDV
0,0.538,24.0
1,0.469,21.6
2,0.469,34.7
3,0.458,33.4


In [90]:
# when using multiple conditions for dataframe or arrays, do not use 'and','or' instead use &, |
boston.loc[(boston["RM"]>8) & (boston["LSTAT"] < 3),:] # because and/or will work for entire object, not element wise comparison
# in R we use & anyway and it doesnt matter.

Unnamed: 0.1,Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
204,204,0.02009,95.0,2.68,0.0,0.4161,8.034,31.9,5.118,4.0,224.0,14.7,390.55,2.88,50.0
232,232,0.57529,0.0,6.2,0.0,0.507,8.337,73.3,3.8384,8.0,307.0,17.4,385.91,2.47,41.7


### 2.4 Modifying Data Frames

In [91]:
import numpy as np
import pandas as pd
df = pd.DataFrame([[0.1, 0.2, 0.3, 0.4], [1, 2, 3, 4], [10, 20, 30, 40]], index = 'r0 r1 r2'.split(), columns = 'c0 c1 c2 c3'.split())
df

Unnamed: 0,c0,c1,c2,c3
r0,0.1,0.2,0.3,0.4
r1,1.0,2.0,3.0,4.0
r2,10.0,20.0,30.0,40.0


In [92]:
df['c4'] = [0.5, 5, 50] # to add a new column
df

Unnamed: 0,c0,c1,c2,c3,c4
r0,0.1,0.2,0.3,0.4,0.5
r1,1.0,2.0,3.0,4.0,5.0
r2,10.0,20.0,30.0,40.0,50.0


**.assign()** to include multiple new columns

In [93]:
df.assign(c5 = [0.6, 6, 60], c6 = [0.7, 7, 70]) # Not happening in place

Unnamed: 0,c0,c1,c2,c3,c4,c5,c6
r0,0.1,0.2,0.3,0.4,0.5,0.6,0.7
r1,1.0,2.0,3.0,4.0,5.0,6.0,7.0
r2,10.0,20.0,30.0,40.0,50.0,60.0,70.0


In [94]:
df

Unnamed: 0,c0,c1,c2,c3,c4
r0,0.1,0.2,0.3,0.4,0.5
r1,1.0,2.0,3.0,4.0,5.0
r2,10.0,20.0,30.0,40.0,50.0


**.append() to add by rows (column names should match if we do not want any missing values**)

In [95]:
df

Unnamed: 0,c0,c1,c2,c3,c4
r0,0.1,0.2,0.3,0.4,0.5
r1,1.0,2.0,3.0,4.0,5.0
r2,10.0,20.0,30.0,40.0,50.0


In [96]:
df2 = pd.DataFrame([[7, 8, 9, 10, 11], [11, 21, 31, 41, 51]], index = 'r3 r4'.split(), columns = 'c0 c1 c2 c3 c4'.split())
df2

Unnamed: 0,c0,c1,c2,c3,c4
r3,7,8,9,10,11
r4,11,21,31,41,51


In [97]:
df.append(df2) # appending doesnt happen in place

Unnamed: 0,c0,c1,c2,c3,c4
r0,0.1,0.2,0.3,0.4,0.5
r1,1.0,2.0,3.0,4.0,5.0
r2,10.0,20.0,30.0,40.0,50.0
r3,7.0,8.0,9.0,10.0,11.0
r4,11.0,21.0,31.0,41.0,51.0


In [98]:
df

Unnamed: 0,c0,c1,c2,c3,c4
r0,0.1,0.2,0.3,0.4,0.5
r1,1.0,2.0,3.0,4.0,5.0
r2,10.0,20.0,30.0,40.0,50.0


**.pop() can be used to remove a column**

In [99]:
df.pop("c4") # returns the popped column and mutates the original df

r0     0.5
r1     5.0
r2    50.0
Name: c4, dtype: float64

In [100]:
df

Unnamed: 0,c0,c1,c2,c3
r0,0.1,0.2,0.3,0.4
r1,1.0,2.0,3.0,4.0
r2,10.0,20.0,30.0,40.0


In [101]:
df["c4"] = [0.5, 5 ,50] # adding the column back
df

Unnamed: 0,c0,c1,c2,c3,c4
r0,0.1,0.2,0.3,0.4,0.5
r1,1.0,2.0,3.0,4.0,5.0
r2,10.0,20.0,30.0,40.0,50.0


**.drop() can be used to remove a column or row**

In [102]:
df.drop("c4",axis=1) # to remove a column. 1 is for columns. Similarly we can delete rows as well. Deletion not happening in place.
# set inplace=True, in the argument if we want the original df to mutate.

Unnamed: 0,c0,c1,c2,c3
r0,0.1,0.2,0.3,0.4
r1,1.0,2.0,3.0,4.0
r2,10.0,20.0,30.0,40.0


In [103]:
df

Unnamed: 0,c0,c1,c2,c3,c4
r0,0.1,0.2,0.3,0.4,0.5
r1,1.0,2.0,3.0,4.0,5.0
r2,10.0,20.0,30.0,40.0,50.0


In [104]:
df.drop(['c2','c4'],axis=1)

Unnamed: 0,c0,c1,c3
r0,0.1,0.2,0.4
r1,1.0,2.0,4.0
r2,10.0,20.0,40.0


In [105]:
df

Unnamed: 0,c0,c1,c2,c3,c4
r0,0.1,0.2,0.3,0.4,0.5
r1,1.0,2.0,3.0,4.0,5.0
r2,10.0,20.0,30.0,40.0,50.0


In [106]:
df.drop('r2',axis=0) # to remove a row

Unnamed: 0,c0,c1,c2,c3,c4
r0,0.1,0.2,0.3,0.4,0.5
r1,1.0,2.0,3.0,4.0,5.0


In [107]:
df

Unnamed: 0,c0,c1,c2,c3,c4
r0,0.1,0.2,0.3,0.4,0.5
r1,1.0,2.0,3.0,4.0,5.0
r2,10.0,20.0,30.0,40.0,50.0


***Extras (Optional***)

In [108]:
df.drop(df.index[0:1],0) # to drop rows by row index position. for columns we could similarly use df.columns

Unnamed: 0,c0,c1,c2,c3,c4
r1,1.0,2.0,3.0,4.0,5.0
r2,10.0,20.0,30.0,40.0,50.0


In [109]:
del df['c3'] # del command can also be used to delete columns. happening in place. del also works for lists and dictionary

In [110]:
df

Unnamed: 0,c0,c1,c2,c4
r0,0.1,0.2,0.3,0.5
r1,1.0,2.0,3.0,5.0
r2,10.0,20.0,30.0,50.0


### reset index

In [111]:
df = pd.DataFrame(np.random.randint(0,10,(2,3)), index = 'r0 r1'.split(), columns = 'c0 c1 c2'.split())
df

Unnamed: 0,c0,c1,c2
r0,1,4,8
r1,1,6,2


In [112]:
df.reset_index(drop=False) # inplace not happening

Unnamed: 0,index,c0,c1,c2
0,r0,1,4,8
1,r1,1,6,2


In [113]:
df

Unnamed: 0,c0,c1,c2
r0,1,4,8
r1,1,6,2


In [114]:
df.set_index("c2", inplace=True) # to set a column as index. if inplace=False, original df unchanged

In [115]:
df

Unnamed: 0_level_0,c0,c1
c2,Unnamed: 1_level_1,Unnamed: 2_level_1
8,1,4
2,1,6


In [116]:
df.index= ['a', 'b'] # to give a new index

In [117]:
df

Unnamed: 0,c0,c1
a,1,4
b,1,6


### Concatenating, Merging dataframes

In [None]:
# for joining data frames, use concatenate and merge
# use merge if we just want to add dataframes by rows or columns as long as all the index and column names are same

**pd.concat** (for concatenating rows or columns to an existing data frame)


In [118]:
df1 = pd.DataFrame(np.random.randn(3,4), index = 'r0 r1 r2'.split(), columns = 'c0 c1 c2 c3'.split())
df1

Unnamed: 0,c0,c1,c2,c3
r0,-0.63076,-1.424806,-0.318257,1.36502
r1,0.396631,-0.212319,1.63327,0.436105
r2,-0.487906,0.770749,1.418715,-0.766619


In [119]:
df2 = pd.DataFrame(np.random.randn(3,4), index = 'r0 r1 r2'.split(), columns = 'c4 c5 c6 c7'.split())
df2 

Unnamed: 0,c4,c5,c6,c7
r0,0.138295,-0.584951,1.331615,-0.622022
r1,0.458289,0.900358,-1.180492,-1.508254
r2,-2.394889,0.693192,0.573815,0.17452


In [120]:
pd.concat([df1,df2], axis=1) # axis = 1 for column, default is 0
# same row indexes, but differnt column names. We can use concat to join data frames together by column

Unnamed: 0,c0,c1,c2,c3,c4,c5,c6,c7
r0,-0.63076,-1.424806,-0.318257,1.36502,0.138295,-0.584951,1.331615,-0.622022
r1,0.396631,-0.212319,1.63327,0.436105,0.458289,0.900358,-1.180492,-1.508254
r2,-0.487906,0.770749,1.418715,-0.766619,-2.394889,0.693192,0.573815,0.17452


In [121]:
df1

Unnamed: 0,c0,c1,c2,c3
r0,-0.63076,-1.424806,-0.318257,1.36502
r1,0.396631,-0.212319,1.63327,0.436105
r2,-0.487906,0.770749,1.418715,-0.766619


In [122]:
df3 = pd.DataFrame(np.random.randn(3,4), index = 'r3 r4 r5'.split(), columns = 'c0 c1 c2 c3'.split())
df3

Unnamed: 0,c0,c1,c2,c3
r3,-1.787443,-0.403099,0.195089,0.286614
r4,-1.626698,0.596692,0.59858,-1.470206
r5,-1.235216,-1.267369,0.134166,0.283894


In [123]:
pd.concat([df1,df3], axis=0) # joining data frames by rows, column names must match here if we dont want any NaN. 
# we could do this with append as well (df1.append(df3))

Unnamed: 0,c0,c1,c2,c3
r0,-0.63076,-1.424806,-0.318257,1.36502
r1,0.396631,-0.212319,1.63327,0.436105
r2,-0.487906,0.770749,1.418715,-0.766619
r3,-1.787443,-0.403099,0.195089,0.286614
r4,-1.626698,0.596692,0.59858,-1.470206
r5,-1.235216,-1.267369,0.134166,0.283894


In [124]:
df1

Unnamed: 0,c0,c1,c2,c3
r0,-0.63076,-1.424806,-0.318257,1.36502
r1,0.396631,-0.212319,1.63327,0.436105
r2,-0.487906,0.770749,1.418715,-0.766619


In [125]:
df2

Unnamed: 0,c4,c5,c6,c7
r0,0.138295,-0.584951,1.331615,-0.622022
r1,0.458289,0.900358,-1.180492,-1.508254
r2,-2.394889,0.693192,0.573815,0.17452


In [126]:
pd.concat([df1,df2], axis=0) # we get NaN because to join by rows, the column names are different.

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,c0,c1,c2,c3,c4,c5,c6,c7
r0,-0.63076,-1.424806,-0.318257,1.36502,,,,
r1,0.396631,-0.212319,1.63327,0.436105,,,,
r2,-0.487906,0.770749,1.418715,-0.766619,,,,
r0,,,,,0.138295,-0.584951,1.331615,-0.622022
r1,,,,,0.458289,0.900358,-1.180492,-1.508254
r2,,,,,-2.394889,0.693192,0.573815,0.17452


**merge**

In [None]:
# if we want to merge by some common column, then we use merge

In [127]:
df1 = pd.DataFrame([[1, 2, 3, 4], [2, 5, 6, 7], [3, 7, 5, 8]], columns = 'c0 c1 c2 c3'.split())
df1

Unnamed: 0,c0,c1,c2,c3
0,1,2,3,4
1,2,5,6,7
2,3,7,5,8


In [128]:
df2 = pd.DataFrame([[1, 21, 31, 41], [2, 51, 61, 71], [3, 71, 51, 81]], columns = 'c0 c5 c6 c7'.split())
df2

Unnamed: 0,c0,c5,c6,c7
0,1,21,31,41
1,2,51,61,71
2,3,71,51,81


In [129]:
df3 = pd.merge(df1,df2) # Pandas automatically recognizes that each data frame has a common 'c0' column. 
# we can also specify in the arguments that on = 'c0' 
df3

Unnamed: 0,c0,c1,c2,c3,c5,c6,c7
0,1,2,3,4,21,31,41
1,2,5,6,7,51,61,71
2,3,7,5,8,71,51,81


In [130]:
df4 = pd.merge(df1,df2,how='inner', left_on = 'c0', right_on='c0') 
df4
# left_on and right_on are used, when the column values might represent the same thing, but just the column names are different.
# here in this example, it doesnt matter, because c0 means same for both right and left columns.

Unnamed: 0,c0,c1,c2,c3,c5,c6,c7
0,1,2,3,4,21,31,41
1,2,5,6,7,51,61,71
2,3,7,5,8,71,51,81


In [None]:
# There are many more merge operations, left join, right join, outer join etc..

## 2.5 Missing Values
* isnull(), isna()
* notnull(), notna()
* dropna()
* fillna()

In [131]:
import pandas as pd
import numpy as np

In [132]:
d = {"c0": [1,2,3,7,None,12], "c1": [4,12, 15, np.nan, 5, 8], "c2":[10, 11, np.nan, np.nan, 9, 12], "c3":[3,6,np.nan, 9, np.nan, 11]}
d

{'c0': [1, 2, 3, 7, None, 12],
 'c1': [4, 12, 15, nan, 5, 8],
 'c2': [10, 11, nan, nan, 9, 12],
 'c3': [3, 6, nan, 9, nan, 11]}

In [133]:
df = pd.DataFrame(d) 
df

Unnamed: 0,c0,c1,c2,c3
0,1.0,4.0,10.0,3.0
1,2.0,12.0,11.0,6.0
2,3.0,15.0,,
3,7.0,,,9.0
4,,5.0,9.0,
5,12.0,8.0,12.0,11.0


**.isnull()**

In [134]:
df.isnull() # To get a dataframe of boolean values. isna() gives the same result

Unnamed: 0,c0,c1,c2,c3
0,False,False,False,False
1,False,False,False,False
2,False,False,True,True
3,False,True,True,False
4,True,False,False,True
5,False,False,False,False


In [135]:
df.isnull().sum(axis=0) # to get a count of number of missing values in each column

c0    1
c1    1
c2    2
c3    2
dtype: int64

**.notnull()**

In [136]:
df.notnull() # just the reverse of isnull

Unnamed: 0,c0,c1,c2,c3
0,True,True,True,True
1,True,True,True,True
2,True,True,False,False
3,True,False,False,True
4,False,True,True,False
5,True,True,True,True


**.dropna()**

In [137]:
df.dropna() # select rows without any NaN values. axis = 0 is the default argument. default inplace = False

Unnamed: 0,c0,c1,c2,c3
0,1.0,4.0,10.0,3.0
1,2.0,12.0,11.0,6.0
5,12.0,8.0,12.0,11.0


In [138]:
df

Unnamed: 0,c0,c1,c2,c3
0,1.0,4.0,10.0,3.0
1,2.0,12.0,11.0,6.0
2,3.0,15.0,,
3,7.0,,,9.0
4,,5.0,9.0,
5,12.0,8.0,12.0,11.0


In [139]:
df.dropna(axis=1) # select columns without any NaN values

0
1
2
3
4
5


In [140]:
df

Unnamed: 0,c0,c1,c2,c3
0,1.0,4.0,10.0,3.0
1,2.0,12.0,11.0,6.0
2,3.0,15.0,,
3,7.0,,,9.0
4,,5.0,9.0,
5,12.0,8.0,12.0,11.0


In [141]:
df.dropna(axis = 0, thresh=3) #here thresh is for non-NA values

Unnamed: 0,c0,c1,c2,c3
0,1.0,4.0,10.0,3.0
1,2.0,12.0,11.0,6.0
5,12.0,8.0,12.0,11.0


In [142]:
df.dropna(axis=1, thresh=5) # look at all columns, and get any column which has atleast 5 non missing values

Unnamed: 0,c0,c1
0,1.0,4.0
1,2.0,12.0
2,3.0,15.0
3,7.0,
4,,5.0
5,12.0,8.0


In [143]:
df

Unnamed: 0,c0,c1,c2,c3
0,1.0,4.0,10.0,3.0
1,2.0,12.0,11.0,6.0
2,3.0,15.0,,
3,7.0,,,9.0
4,,5.0,9.0,
5,12.0,8.0,12.0,11.0


**fillna()**

In [144]:
df.fillna(100) # fill missing values with a constant

Unnamed: 0,c0,c1,c2,c3
0,1.0,4.0,10.0,3.0
1,2.0,12.0,11.0,6.0
2,3.0,15.0,100.0,100.0
3,7.0,100.0,100.0,9.0
4,100.0,5.0,9.0,100.0
5,12.0,8.0,12.0,11.0


In [145]:
df

Unnamed: 0,c0,c1,c2,c3
0,1.0,4.0,10.0,3.0
1,2.0,12.0,11.0,6.0
2,3.0,15.0,,
3,7.0,,,9.0
4,,5.0,9.0,
5,12.0,8.0,12.0,11.0


In [146]:
df.fillna(df.mean(axis=0)) # fill na with mean values in each column. default inplace=False

Unnamed: 0,c0,c1,c2,c3
0,1.0,4.0,10.0,3.0
1,2.0,12.0,11.0,6.0
2,3.0,15.0,10.5,7.25
3,7.0,8.8,10.5,9.0
4,5.0,5.0,9.0,7.25
5,12.0,8.0,12.0,11.0


In [147]:
df

Unnamed: 0,c0,c1,c2,c3
0,1.0,4.0,10.0,3.0
1,2.0,12.0,11.0,6.0
2,3.0,15.0,,
3,7.0,,,9.0
4,,5.0,9.0,
5,12.0,8.0,12.0,11.0


In [148]:
df.fillna(method='ffill') # similarly we have 'bfill'

Unnamed: 0,c0,c1,c2,c3
0,1.0,4.0,10.0,3.0
1,2.0,12.0,11.0,6.0
2,3.0,15.0,11.0,6.0
3,7.0,15.0,11.0,9.0
4,7.0,5.0,9.0,9.0
5,12.0,8.0,12.0,11.0


## 2.6 Numerical operations on Data Frames

In [149]:
df1 = pd.DataFrame(np.random.randint(0,10,(3,4)), index = 'r0 r1 r2'.split(), columns='c0 c1 c2 c3'.split())
df2 = pd.DataFrame(np.random.randint(0,10,(3,4)), index = 'r0 r1 r2'.split(), columns='c0 c1 c2 c3'.split())
print (df1, '\n')
print (df2)

    c0  c1  c2  c3
r0   5   6   2   5
r1   2   0   2   6
r2   5   6   6   8 

    c0  c1  c2  c3
r0   2   0   0   3
r1   9   2   7   4
r2   7   2   4   3


In [150]:
df1 + 8 # broadcasting operations similar to numpy arrays.

Unnamed: 0,c0,c1,c2,c3
r0,13,14,10,13
r1,10,8,10,14
r2,13,14,14,16


In [151]:
df1

Unnamed: 0,c0,c1,c2,c3
r0,5,6,2,5
r1,2,0,2,6
r2,5,6,6,8


In [152]:
df1 + df2

Unnamed: 0,c0,c1,c2,c3
r0,7,6,2,8
r1,11,2,9,10
r2,12,8,10,11


**any Numpy Ufunc, will work on Pandas Series and DataFrames**

In [153]:
np.exp(df1) # we can also apply Numpy Ufuncs directly on Pandas series or data frames.

Unnamed: 0,c0,c1,c2,c3
r0,148.413159,403.428793,7.389056,148.413159
r1,7.389056,1.0,7.389056,403.428793
r2,148.413159,403.428793,403.428793,2980.957987


## reading and writing

In [None]:
# typically pd.read_ will give many options to read and df.to_ many options to write
# pd.read_csv, pd.read_excel, df.to_csv etc..

In [154]:
boston.to_csv('boston_copy.csv')