## Introduction to Data Science

### Introduction to Numpy and Pandas

In [1]:
#import pylab
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os
import pathlib


%matplotlib inline
#%matplotlib notebook

## [Numpy Basics](https://www.datacamp.com/community/tutorials/python-numpy-tutorial)

What Is A Python Numpy Array?  

NumPy arrays are a bit like Python lists, but still very much different at the same time. For those of you who are new to the topic, let’s clarify what it exactly is and what it’s good for.   

As the name kind of gives away, a NumPy array is a central data structure of the numpy library. The library’s name is actually short for “Numeric Python” or “Numerical Python”.  

In other words, NumPy is a Python library that is the core library for scientific computing in Python. It contains a collection of tools and techniques that can be used to solve on a computer mathematical models of problems in Science and Engineering. One of these tools is a high-performance multidimensional array object that is a powerful data structure for efficient computation of arrays and matrices. To work with these arrays, there’s a huge amount of high-level mathematical functions operate on these matrices and arrays.  

Then, what is an array?  

When you look at the print of a couple arrays, you could see it as grid that contains values of the same type. The array holds and represents any regular data in a structured way.  

However, you should know that, on a structural level, an array is basically nothing but pointers. It’s a combination of a memory address, a data type, a shape and strides:  

The data pointer indicates the memory address of the first byte in the array,  
The data type or dtype pointer describes the kind of elements that are contained within the array,  
The shape indicates the shape of the array, and
The strides are the number of bytes that should be skipped in memory to go to the next element.   

If your strides are (10,1), you need to proceed one byte to get to the next column and 10 bytes to locate the next row.  

### Creating Arrays:

In [2]:
my_numbers = [1,2,3,4]
simple_array = np.array(my_numbers)
print(simple_array)

[1 2 3 4]


In [3]:
type(simple_array)

numpy.ndarray

In [4]:
simple_array + 34

array([35, 36, 37, 38])

In [5]:
simple_array.shape

(4,)

In [6]:
simple_array.dtype

dtype('int64')

In [7]:
simple_array.data

<memory at 0x7fac8eeb3048>

In [8]:
simple_array.strides

(8,)

In [9]:
my_other_numbers = [[1,2,3],[4,5,6],[7,8,9]]
other_simple_array = np.array(my_other_numbers)
other_simple_array

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [10]:
other_simple_array.shape

(3, 3)

In [11]:
#a = np.arange(20)
a = np.arange(1,5,0.2)
print(a)

[1.  1.2 1.4 1.6 1.8 2.  2.2 2.4 2.6 2.8 3.  3.2 3.4 3.6 3.8 4.  4.2 4.4
 4.6 4.8]


In [12]:
b = np.linspace(1,10,30)
#b = np.linspace(1,2*np.pi,50)
print(b)

[ 1.          1.31034483  1.62068966  1.93103448  2.24137931  2.55172414
  2.86206897  3.17241379  3.48275862  3.79310345  4.10344828  4.4137931
  4.72413793  5.03448276  5.34482759  5.65517241  5.96551724  6.27586207
  6.5862069   6.89655172  7.20689655  7.51724138  7.82758621  8.13793103
  8.44827586  8.75862069  9.06896552  9.37931034  9.68965517 10.        ]


In [13]:
b2 = np.logspace(1,100,30)
print(b2)

[1.00000000e+001 2.59294380e+004 6.72335754e+007 1.74332882e+011
 4.52035366e+014 1.17210230e+018 3.03919538e+021 7.88046282e+024
 2.04335972e+028 5.29831691e+031 1.37382380e+035 3.56224789e+038
 9.23670857e+041 2.39502662e+045 6.21016942e+048 1.61026203e+052
 4.17531894e+055 1.08263673e+059 2.80721620e+062 7.27895384e+065
 1.88739182e+069 4.89390092e+072 1.26896100e+076 3.29034456e+079
 8.53167852e+082 2.21221629e+086 5.73615251e+089 1.48735211e+093
 3.85662042e+096 1.00000000e+100]


In [14]:
a1 = np.zeros((3,4))
print(a1)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [15]:
a2 = np.ones((2,2))
print(a2)

[[1. 1.]
 [1. 1.]]


In [16]:
a3 = np.empty((2,3))
print(a3)

[[2.77845286e-316 0.00000000e+000 0.00000000e+000]
 [0.00000000e+000 0.00000000e+000 0.00000000e+000]]


In [17]:
a4 = np.identity(3)
print(a4)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In [18]:
a5 = np.eye(2)        
print(a5)

[[1. 0.]
 [0. 1.]]


In [19]:
a6 = np.full((4,2), 7)
print(a6)

[[7 7]
 [7 7]
 [7 7]
 [7 7]]


In [20]:
a7 = np.random.random((2,2))  
print(a7)

[[0.31791486 0.35186112]
 [0.42719564 0.35669301]]


#### Modifying Dimensions:

In [21]:
c = np.arange(100)
print(c)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]


In [22]:
print(c.shape)
print(np.ndim(c))
print(c.dtype.name)

(100,)
1
int64


In [23]:
d = c.reshape(4,5,5)
print(d)

[[[ 0  1  2  3  4]
  [ 5  6  7  8  9]
  [10 11 12 13 14]
  [15 16 17 18 19]
  [20 21 22 23 24]]

 [[25 26 27 28 29]
  [30 31 32 33 34]
  [35 36 37 38 39]
  [40 41 42 43 44]
  [45 46 47 48 49]]

 [[50 51 52 53 54]
  [55 56 57 58 59]
  [60 61 62 63 64]
  [65 66 67 68 69]
  [70 71 72 73 74]]

 [[75 76 77 78 79]
  [80 81 82 83 84]
  [85 86 87 88 89]
  [90 91 92 93 94]
  [95 96 97 98 99]]]


In [24]:
print(d.shape)
print(np.ndim(d))
print(d.dtype.name)

(4, 5, 5)
3
int64


In [25]:
d = np.random.random(100).reshape(4,25)
print(d)

[[0.45068489 0.19284814 0.11074153 0.10252809 0.56722408 0.31763003
  0.79485104 0.06202026 0.51534286 0.39155939 0.9815717  0.55434952
  0.5795133  0.96223316 0.21459089 0.10270266 0.53455622 0.8893288
  0.447744   0.9945785  0.34029828 0.09180178 0.27734127 0.45185627
  0.20283199]
 [0.7055252  0.67150457 0.09390324 0.45175973 0.48256551 0.98061543
  0.22287316 0.76370071 0.19403511 0.10089044 0.94035812 0.48170316
  0.08914709 0.84605763 0.65990758 0.47403601 0.33174413 0.20842301
  0.76250569 0.67148728 0.6143641  0.5230728  0.76480379 0.45957459
  0.39412321]
 [0.72258069 0.27347625 0.23556194 0.38706457 0.01951955 0.78415685
  0.35352503 0.83268744 0.5269175  0.4460447  0.61093277 0.06722146
  0.73316103 0.2524642  0.37405828 0.25482602 0.95280915 0.22957499
  0.2377016  0.1964734  0.6978898  0.15285068 0.85293263 0.63596463
  0.04338911]
 [0.65078514 0.13681655 0.00145312 0.21803232 0.14609543 0.94038325
  0.59028884 0.2979303  0.94634664 0.25897919 0.02568256 0.88409901
  0.332

In [26]:
print(d.shape)
print(np.ndim(d))
print(d.dtype.name)

(4, 25)
2
float64


In [27]:
x = np.arange(12).reshape((3,4))
print(x)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [28]:
print(x.shape)
print(np.ndim(x))
print(x.dtype.name)

(3, 4)
2
int64


In [29]:
x.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [30]:
print(x.ravel().shape)
print(np.ndim(x.ravel()))
print(x.ravel().dtype.name)

(12,)
1
int64


In [31]:
# Resize `x` to ((6,4))
y = np.resize(x, (7,5))
print(x, '\n\n\n', y)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]] 


 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11  0  1  2]
 [ 3  4  5  6  7]
 [ 8  9 10 11  0]
 [ 1  2  3  4  5]
 [ 6  7  8  9 10]]


### Slicing multidimensional arrays

In [32]:
d = np.arange(40).reshape(4,2,5)
print(d)

[[[ 0  1  2  3  4]
  [ 5  6  7  8  9]]

 [[10 11 12 13 14]
  [15 16 17 18 19]]

 [[20 21 22 23 24]
  [25 26 27 28 29]]

 [[30 31 32 33 34]
  [35 36 37 38 39]]]


In [33]:
d[1,:,3:]

array([[13, 14],
       [18, 19]])

In [34]:
d.shape

(4, 2, 5)

In [35]:
d[d%2==0]

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,
       34, 36, 38])

In [36]:
d[~d%2==0]  #negation of condition

array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,
       35, 37, 39])

In [37]:
# Create a new array from which we will select elements
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
print(a)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


In [38]:
# Create an array of indices
b = np.array([0, 2, 0, 1])
c = np.arange(4)

print(b, '\n\n', c)

[0 2 0 1] 

 [0 1 2 3]


In [39]:
# Select one element from each row of a using the indices in b
print(a[c, b])

[ 1  6  7 11]


In [40]:
# Mutate one element from each row of a using the indices in b
a[c, b] += 10
print(a)

[[11  2  3]
 [ 4  5 16]
 [17  8  9]
 [10 21 12]]


In [41]:
d

array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]],

       [[20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29]],

       [[30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39]]])

In [42]:
# Boolean array indexing
bool_idx = (d > 13)
print(bool_idx)

[[[False False False False False]
  [False False False False False]]

 [[False False False False  True]
  [ True  True  True  True  True]]

 [[ True  True  True  True  True]
  [ True  True  True  True  True]]

 [[ True  True  True  True  True]
  [ True  True  True  True  True]]]


In [43]:
print(d[bool_idx])

[14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
 38 39]


### datatypes

In [44]:
x = np.array([1, 2])   # Let numpy choose the datatype
print(x.dtype)         # Prints "int64"

int64


In [45]:
x = np.array([1.0, 2.0])   # Let numpy choose the datatype
print(x.dtype)             # Prints "float64"

float64


In [46]:
x = np.array([1, 2], dtype=np.float64)   # Force a particular datatype
print(x.dtype)                         # Prints "int64"

float64


### Array Math

### Inline and vectorized operations:

In [47]:
a

array([[11,  2,  3],
       [ 4,  5, 16],
       [17,  8,  9],
       [10, 21, 12]])

In [48]:
a * 2

array([[22,  4,  6],
       [ 8, 10, 32],
       [34, 16, 18],
       [20, 42, 24]])

In [49]:
# the original array stays the same
a

array([[11,  2,  3],
       [ 4,  5, 16],
       [17,  8,  9],
       [10, 21, 12]])

In [50]:
a.cumsum()

array([ 11,  13,  16,  20,  25,  41,  58,  66,  75,  85, 106, 118])

In [51]:
a = np.arange(16).reshape(4,4)
np.vstack([a,np.arange(4).reshape(1,4)])

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [ 0,  1,  2,  3]])

In [52]:
np.hstack([a,np.arange(4).reshape(4,1)])

array([[ 0,  1,  2,  3,  0],
       [ 4,  5,  6,  7,  1],
       [ 8,  9, 10, 11,  2],
       [12, 13, 14, 15,  3]])

In [53]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
print(np.add(x, y))

[[ 6.  8.]
 [10. 12.]]
[[ 6.  8.]
 [10. 12.]]


In [54]:
# Elementwise difference; both produce the array
# [[-4.0 -4.0]
#  [-4.0 -4.0]]
print(x - y)
print(np.subtract(x, y))

[[-4. -4.]
 [-4. -4.]]
[[-4. -4.]
 [-4. -4.]]


In [55]:
# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)
print(np.multiply(x, y))

[[ 5. 12.]
 [21. 32.]]
[[ 5. 12.]
 [21. 32.]]


In [56]:
# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

[[0.2        0.33333333]
 [0.42857143 0.5       ]]
[[0.2        0.33333333]
 [0.42857143 0.5       ]]


In [57]:
# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))

[[1.         1.41421356]
 [1.73205081 2.        ]]


In [58]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])
v = np.array([9,10])
w = np.array([11, 12])

In [59]:
print(x)
print()
print(y)
print()
print(v)
print()
print(w)

[[1 2]
 [3 4]]

[[5 6]
 [7 8]]

[ 9 10]

[11 12]


In [60]:
# Inner product of vectors; both produce 219
print(v.dot(w), '\n')
print(np.dot(v, w))

219 

219


In [61]:
# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v), '\n')
print(np.dot(x, v))

[29 67] 

[29 67]


In [62]:
# Matrix / matrix product; both produce the rank 2 array
print(x.dot(y), '\n')
print(np.dot(x, y))

[[19 22]
 [43 50]] 

[[19 22]
 [43 50]]


In [63]:
x = np.array([[1,2],[3,4]])
print(x, '\n')
print(np.sum(x), '\n')  # Compute sum of all elements; prints "10"
print(np.sum(x, axis=0), '\n')  # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1))  # Compute sum of each row; prints "[3 7]"

[[1 2]
 [3 4]] 

10 

[4 6] 

[3 7]


In [64]:
x = np.array([[1,2], [3,4]])

In [65]:
print(x, '\n')
print(x.T)

[[1 2]
 [3 4]] 

[[1 3]
 [2 4]]


In [66]:
# Note that taking the transpose of a rank 1 array does nothing:
v = np.array([1,2,3])

In [67]:
print(v, '\n')
print(v.T)

[1 2 3] 

[1 2 3]


In [68]:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = np.empty_like(x)   # Create an empty matrix with the same shape as x

In [69]:
print(x, '\n')
print(v, '\n')
print(y)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]] 

[1 0 1] 

[[140380469214192               0               0]
 [              0               0               0]
 [              0               0               0]
 [              0               0               0]]


In [70]:
for i in range(4):
    y[i, :] = x[i, :] + v
print(y)

[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]
 [11 11 13]]


This works; however when the matrix x is very large, computing an explicit loop in Python could be slow. Note that adding the vector v to each row of the matrix x is equivalent to forming a matrix vv by stacking multiple copies of v vertically, then performing elementwise summation of x and vv. We could implement this approach like this:

In [71]:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
vv = np.tile(v, (4, 1))   # Stack 4 copies of v on top of each other
print(vv)

[[1 0 1]
 [1 0 1]
 [1 0 1]
 [1 0 1]]


In [72]:
y = x + vv  # Add x and vv elementwise
print(y)

[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]
 [11 11 13]]


Numpy broadcasting allows us to perform this computation without actually creating multiple copies of v. Consider this version, using broadcasting:

In [73]:
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v  # Add v to each row of x using broadcasting
print(y)

[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]
 [11 11 13]]


The line y = x + v works even though x has shape (4, 3) and v has shape (3,) due to broadcasting; this line works as if v actually had shape (4, 3), where each row was a copy of v, and the sum was performed elementwise.

In [74]:
# Initialize `x` and `y`
x = np.ones((3,4))
y = np.random.random((5,1,4))

# Add `x` and `y`
print(x + y)

[[[1.64548702 1.33570947 1.90261238 1.08842856]
  [1.64548702 1.33570947 1.90261238 1.08842856]
  [1.64548702 1.33570947 1.90261238 1.08842856]]

 [[1.3307085  1.09386741 1.34957408 1.12363104]
  [1.3307085  1.09386741 1.34957408 1.12363104]
  [1.3307085  1.09386741 1.34957408 1.12363104]]

 [[1.59715733 1.90091328 1.46152321 1.8292081 ]
  [1.59715733 1.90091328 1.46152321 1.8292081 ]
  [1.59715733 1.90091328 1.46152321 1.8292081 ]]

 [[1.93802493 1.00210286 1.67563666 1.44231374]
  [1.93802493 1.00210286 1.67563666 1.44231374]
  [1.93802493 1.00210286 1.67563666 1.44231374]]

 [[1.90387431 1.05722786 1.65004156 1.98904379]
  [1.90387431 1.05722786 1.65004156 1.98904379]
  [1.90387431 1.05722786 1.65004156 1.98904379]]]


You see that, even though x and y seem to have somewhat different dimensions, the two can be added together.  
That is because they are compatible in all dimensions:

    Array x has dimensions 3 X 4,
    Array y has dimensions 5 X 1 X 4

Since you have seen above that dimensions are also compatible if one of them is equal to 1, you see that these two arrays are indeed a good candidate for broadcasting!  

What you will notice is that in the dimension where y has size 1 and the other array has a size greater than 1 (that is, 3), the first array behaves as if it were copied along that dimension.  

Note that the shape of the resulting array will again be the maximum size along each dimension of x and y: the dimension of the result will be (5,3,4)  

In short, if you want to make use of broadcasting, you will rely a lot on the shape and dimensions of the arrays with which you’re working.  

# Pandas

### Pandas Data Structures: Series

In [75]:
obj = pd.Series([4, 7, -5, 3, 5])
obj

0    4
1    7
2   -5
3    3
4    5
dtype: int64

In [76]:
obj.values

array([ 4,  7, -5,  3,  5])

In [77]:
obj.index

RangeIndex(start=0, stop=5, step=1)

In [78]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan', 'Fernie']
obj

Bob       4
Steve     7
Jeff     -5
Ryan      3
Fernie    5
dtype: int64

In [79]:
obj['Bob']

4

In [80]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [81]:
obj2['c']

3

In [82]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

In [83]:
obj2[obj2 < 0]

a   -5
dtype: int64

In [84]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [85]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [86]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [87]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [88]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [89]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [90]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [91]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

### Pandas Data Structures: Dataframe

In [92]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],'year': [2000, 2001, 2002, 2001, 2002],'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [93]:
frame['pop']

0    1.5
1    1.7
2    3.6
3    2.4
4    2.9
Name: pop, dtype: float64

In [94]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [95]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],index=['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [96]:
frame2['nova'] = 13
frame2

Unnamed: 0,year,state,pop,debt,nova
one,2000,Ohio,1.5,,13
two,2001,Ohio,1.7,,13
three,2002,Ohio,3.6,,13
four,2001,Nevada,2.4,,13
five,2002,Nevada,2.9,,13


In [97]:
frame2.nova = 23
frame2

Unnamed: 0,year,state,pop,debt,nova
one,2000,Ohio,1.5,,23
two,2001,Ohio,1.7,,23
three,2002,Ohio,3.6,,23
four,2001,Nevada,2.4,,23
five,2002,Nevada,2.9,,23


In [98]:
frame2.columns

Index(['year', 'state', 'pop', 'debt', 'nova'], dtype='object')

In [99]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [100]:
frame2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [101]:
#frame2.loc['three']
frame2.loc['three','state']

'Ohio'

In [102]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt,nova
one,2000,Ohio,1.5,16.5,23
two,2001,Ohio,1.7,16.5,23
three,2002,Ohio,3.6,16.5,23
four,2001,Nevada,2.4,16.5,23
five,2002,Nevada,2.9,16.5,23


In [103]:
frame2['debt'] = np.arange(5.)
frame2

Unnamed: 0,year,state,pop,debt,nova
one,2000,Ohio,1.5,0.0,23
two,2001,Ohio,1.7,1.0,23
three,2002,Ohio,3.6,2.0,23
four,2001,Nevada,2.4,3.0,23
five,2002,Nevada,2.9,4.0,23


In [104]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt,nova
one,2000,Ohio,1.5,,23
two,2001,Ohio,1.7,-1.2,23
three,2002,Ohio,3.6,,23
four,2001,Nevada,2.4,-1.5,23
five,2002,Nevada,2.9,-1.7,23


In [105]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,nova,eastern
one,2000,Ohio,1.5,,23,True
two,2001,Ohio,1.7,-1.2,23,True
three,2002,Ohio,3.6,,23,True
four,2001,Nevada,2.4,-1.5,23,False
five,2002,Nevada,2.9,-1.7,23,False


In [106]:
del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'debt', 'nova'], dtype='object')

In [107]:
transpose = frame2.pivot(index= 'year', columns='state', values='pop') 
transpose

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [108]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [109]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


In [110]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [111]:
pdata = {'Ohio': frame3['Ohio'][:-1],'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7


In [112]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [113]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame4 = pd.DataFrame(pop)
frame4

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [114]:
frame4.loc[2000,'Nevada'] = 2
frame4

Unnamed: 0,Nevada,Ohio
2000,2.0,1.5
2001,2.4,1.7
2002,2.9,3.6


In [115]:
frame5 = pd.concat([frame4, frame4])
frame5

Unnamed: 0,Nevada,Ohio
2000,2.0,1.5
2001,2.4,1.7
2002,2.9,3.6
2000,2.0,1.5
2001,2.4,1.7
2002,2.9,3.6


In [116]:
frame5.drop_duplicates(['Nevada'])

Unnamed: 0,Nevada,Ohio
2000,2.0,1.5
2001,2.4,1.7
2002,2.9,3.6


In [117]:
dates = pd.date_range("20160101", periods=10)
data = np.random.random((10,3))
column_names = ['Column1', 'Column2', 'Column3']
df = pd.DataFrame(data, index=dates, columns=column_names)
df.head(10)

Unnamed: 0,Column1,Column2,Column3
2016-01-01,0.617296,0.822074,0.588782
2016-01-02,0.470787,0.987033,0.021716
2016-01-03,0.735567,0.273703,0.461745
2016-01-04,0.004981,0.999473,0.495767
2016-01-05,0.233576,0.539394,0.140252
2016-01-06,0.540808,0.618333,0.414181
2016-01-07,0.516582,0.100819,0.333751
2016-01-08,0.857943,0.814915,0.080223
2016-01-09,0.651855,0.705682,0.634107
2016-01-10,0.746688,0.32706,0.210018


In [118]:
df[1:3]

Unnamed: 0,Column1,Column2,Column3
2016-01-02,0.470787,0.987033,0.021716
2016-01-03,0.735567,0.273703,0.461745


In [119]:
df['20160104':'20160107']

Unnamed: 0,Column1,Column2,Column3
2016-01-04,0.004981,0.999473,0.495767
2016-01-05,0.233576,0.539394,0.140252
2016-01-06,0.540808,0.618333,0.414181
2016-01-07,0.516582,0.100819,0.333751


In [120]:
df.loc['20160101':'20160102',['Column1','Column3']]

Unnamed: 0,Column1,Column3
2016-01-01,0.617296,0.588782
2016-01-02,0.470787,0.021716


In [121]:
df.iloc[3:5, 0:2]

Unnamed: 0,Column1,Column2
2016-01-04,0.004981,0.999473
2016-01-05,0.233576,0.539394


In [122]:
df.describe()

Unnamed: 0,Column1,Column2,Column3
count,10.0,10.0,10.0
mean,0.537608,0.618849,0.338054
std,0.255161,0.306712,0.215655
min,0.004981,0.100819,0.021716
25%,0.482235,0.380144,0.157694
50%,0.579052,0.662008,0.373966
75%,0.714639,0.820284,0.487262
max,0.857943,0.999473,0.634107


In [123]:
df.sort_index(axis=0, ascending=False,) # inplace=True)

Unnamed: 0,Column1,Column2,Column3
2016-01-10,0.746688,0.32706,0.210018
2016-01-09,0.651855,0.705682,0.634107
2016-01-08,0.857943,0.814915,0.080223
2016-01-07,0.516582,0.100819,0.333751
2016-01-06,0.540808,0.618333,0.414181
2016-01-05,0.233576,0.539394,0.140252
2016-01-04,0.004981,0.999473,0.495767
2016-01-03,0.735567,0.273703,0.461745
2016-01-02,0.470787,0.987033,0.021716
2016-01-01,0.617296,0.822074,0.588782


In [124]:
df.sort_values(by='Column2')

Unnamed: 0,Column1,Column2,Column3
2016-01-07,0.516582,0.100819,0.333751
2016-01-03,0.735567,0.273703,0.461745
2016-01-10,0.746688,0.32706,0.210018
2016-01-05,0.233576,0.539394,0.140252
2016-01-06,0.540808,0.618333,0.414181
2016-01-09,0.651855,0.705682,0.634107
2016-01-08,0.857943,0.814915,0.080223
2016-01-01,0.617296,0.822074,0.588782
2016-01-02,0.470787,0.987033,0.021716
2016-01-04,0.004981,0.999473,0.495767


In [125]:
dates1 = pd.date_range("20160101", periods=6)
data1 = np.random.random((6,2))
column_names1 = ['ColumnA', 'ColumnB']

dates2 = pd.date_range("20160101", periods=7)
data2 = np.random.random((7,2))
column_names2 = ['ColumnC', 'ColumnD']

df1 = pd.DataFrame(data1, index=dates1, columns=column_names1)
df2 = pd.DataFrame(data2, index=dates2, columns=column_names2)

In [126]:
df1.head()

Unnamed: 0,ColumnA,ColumnB
2016-01-01,0.521787,0.938808
2016-01-02,0.884031,0.189103
2016-01-03,0.323498,0.711633
2016-01-04,0.760178,0.93445
2016-01-05,0.326683,0.285867


In [127]:
df2.head()

Unnamed: 0,ColumnC,ColumnD
2016-01-01,0.784568,0.992253
2016-01-02,0.809385,0.946755
2016-01-03,0.749843,0.053893
2016-01-04,0.408027,0.896876
2016-01-05,0.813036,0.469663


In [128]:
df1.join(df2)

Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD
2016-01-01,0.521787,0.938808,0.784568,0.992253
2016-01-02,0.884031,0.189103,0.809385,0.946755
2016-01-03,0.323498,0.711633,0.749843,0.053893
2016-01-04,0.760178,0.93445,0.408027,0.896876
2016-01-05,0.326683,0.285867,0.813036,0.469663
2016-01-06,0.521432,0.168572,0.099519,0.000667


In [129]:
df3 = df1.join(df2)

# add a column to df to group on
df3['ProfitLoss'] = pd.Series(['Profit', 
                               'Loss', 
                               'Profit', 
                               'Profit', 
                               'Profit', 
                               'Loss', 
                               'Profit', 
                               'Profit', 
                               'Profit', 
                               'Loss'], index=dates)

In [130]:
df3.head()

Unnamed: 0,ColumnA,ColumnB,ColumnC,ColumnD,ProfitLoss
2016-01-01,0.521787,0.938808,0.784568,0.992253,Profit
2016-01-02,0.884031,0.189103,0.809385,0.946755,Loss
2016-01-03,0.323498,0.711633,0.749843,0.053893,Profit
2016-01-04,0.760178,0.93445,0.408027,0.896876,Profit
2016-01-05,0.326683,0.285867,0.813036,0.469663,Profit


In [131]:
df3.groupby('ProfitLoss').mean()

Unnamed: 0_level_0,ColumnA,ColumnB,ColumnC,ColumnD
ProfitLoss,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Loss,0.702732,0.178837,0.454452,0.473711
Profit,0.483036,0.717689,0.688869,0.603171
