# FLIP (00): Data Science 
**(Module 01: Data Science)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use,but NOT allowed to change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---

## Session 2 Operations in Pandas

# Pandas

Credits: The following are notes taken while working through [Python for Data Analysis](http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793) by Wes McKinney

* Series
* DataFrame
* Reindexing
* Dropping Entries
* Indexing, Selecting, Filtering
* Arithmetic and Data Alignment
* Function Application and Mapping
* Sorting and Ranking
* Axis Indices with Duplicate Values
* Summarizing and Computing Descriptive Statistics
* Cleaning Data (Under Construction)
* Input and Output (Under Construction)

In [102]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

## Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels.  The data can be any NumPy data type and the labels are the Series' index.

Create a Series:

In [103]:
ser_1 = Series([1,7,77,777,77777,1546])
ser_1

0        1
1        7
2       77
3      777
4    77777
5     1546
dtype: int64

Get the array representation of a Series:

In [104]:
ser_1.values

array([    1,     7,    77,   777, 77777,  1546])

Index objects are immutable and hold the axis labels and metadata such as names and axis names.

Get the index of the Series:

In [105]:
ser_1.index

RangeIndex(start=0, stop=6, step=1)

Create a Series with a custom index:

In [106]:
ser_2 = Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])
ser_2
Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])

a    1
b    1
c    2
d   -3
e   -5
dtype: int64

Get a value from a Series:

In [107]:
ser_2[3] == ser_2['d']


True

Get a set of values from a Series by passing in a list:

In [108]:
ser_2[['c', 'a', 'b']]

c    2
a    1
b    1
dtype: int64

Get values great than 0:

In [109]:
ser_2[ser_2 > 0]

a    1
b    1
c    2
dtype: int64

Scalar multiply:

In [110]:
ser_2 * 2

a     2
b     2
c     4
d    -6
e   -10
dtype: int64

Apply a numpy math function:

In [111]:
import numpy as np
np.exp(ser_2)

a    2.718282
b    2.718282
c    7.389056
d    0.049787
e    0.006738
dtype: float64

A Series is like a fixed-length, ordered dict.  

Create a series by passing in a dict:

In [112]:
dict_1 = {'foo' : 100, 'bar' : 200, 'baz' : 300}
ser_3 = Series(dict_1)
ser_3

foo    100
bar    200
baz    300
dtype: int64

Re-order a Series by passing in an index (indices not found are NaN):

In [113]:
index = ['foo', 'bar', 'baz', 'qux']
ser_4 = Series(dict_1, index=index) 
ser_4

foo    100.0
bar    200.0
baz    300.0
qux      NaN
dtype: float64

Check for NaN with the pandas method:

In [114]:
pd.isnull(ser_4)

foo    False
bar    False
baz    False
qux     True
dtype: bool

Check for NaN with the Series method:

In [115]:
ser_4.isnull()

foo    False
bar    False
baz    False
qux     True
dtype: bool

Series automatically aligns differently indexed data in arithmetic operations:

In [116]:
ser_3 + ser_4

bar    400.0
baz    600.0
foo    200.0
qux      NaN
dtype: float64

Name a Series:

In [117]:
ser_4.name = 'foobarbazqux'

Name a Series index:

In [118]:
ser_4.index.name = 'label'

In [119]:
ser_4

label
foo    100.0
bar    200.0
baz    300.0
qux      NaN
Name: foobarbazqux, dtype: float64

Rename a Series' index in place:

In [120]:
ser_4.index = ['fo', 'br', 'bz', 'qx']
ser_4

fo    100.0
br    200.0
bz    300.0
qx      NaN
Name: foobarbazqux, dtype: float64

## DataFrame

A DataFrame is a tabular data structure containing an ordered collection of columns.  Each column can have a different type.  DataFrames have both row and column indices and is analogous to a dict of Series.  Row and column operations are treated roughly symmetrically.  Columns returned when indexing a DataFrame are views of the underlying data, not a copy.  To obtain a copy, use the Series' copy method.

Create a DataFrame:

In [121]:
data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'pop' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = DataFrame(data_1)
df_1


Unnamed: 0,state,year,pop
0,VA,2012,5.0
1,VA,2013,5.1
2,VA,2014,5.2
3,MD,2014,4.0
4,MD,2015,4.1


Create a DataFrame specifying a sequence of columns:

In [122]:
df_2 = DataFrame(data_1, columns=['year', 'state', 'pop'])
df_2

Unnamed: 0,year,state,pop
0,2012,VA,5.0
1,2013,VA,5.1
2,2014,VA,5.2
3,2014,MD,4.0
4,2015,MD,4.1


Like Series, columns that are not present in the data are NaN:

In [123]:
df_3 = DataFrame(data_1, columns=['year', 'state', 'pop', 'unempl'])
df_3

Unnamed: 0,year,state,pop,unempl
0,2012,VA,5.0,
1,2013,VA,5.1,
2,2014,VA,5.2,
3,2014,MD,4.0,
4,2015,MD,4.1,


Retrieve a column by key, returning a Series:


In [124]:
df_3['state']

0    VA
1    VA
2    VA
3    MD
4    MD
Name: state, dtype: object

In [125]:
df_3['year']

0    2012
1    2013
2    2014
3    2014
4    2015
Name: year, dtype: int64

Retrive a column by attribute, returning a Series:

In [126]:
df_3.pop

<bound method DataFrame.pop of    year state  pop unempl
0  2012    VA  5.0    NaN
1  2013    VA  5.1    NaN
2  2014    VA  5.2    NaN
3  2014    MD  4.0    NaN
4  2015    MD  4.1    NaN>

Retrieve a row by position:

In [127]:
df_3.loc[0]

year      2012
state       VA
pop        5.0
unempl     NaN
Name: 0, dtype: object

Update a column by assignment:

In [128]:
df_3['unempl'] = np.arange(5)
df_3

Unnamed: 0,year,state,pop,unempl
0,2012,VA,5.0,0
1,2013,VA,5.1,1
2,2014,VA,5.2,2
3,2014,MD,4.0,3
4,2015,MD,4.1,4


Assign a Series to a column (note if assigning a list or array, the length must match the DataFrame, unlike a Series):

In [129]:
unempl = Series([6.0, 6.0, 6.1], index=[2, 3, 4])
df_3['unempl'] = unempl
df_3

Unnamed: 0,year,state,pop,unempl
0,2012,VA,5.0,
1,2013,VA,5.1,
2,2014,VA,5.2,6.0
3,2014,MD,4.0,6.0
4,2015,MD,4.1,6.1


Assign a new column that doesn't exist to create a new column:

In [130]:
df_3['state_dup'] = df_3['state']
df_3

Unnamed: 0,year,state,pop,unempl,state_dup
0,2012,VA,5.0,,VA
1,2013,VA,5.1,,VA
2,2014,VA,5.2,6.0,VA
3,2014,MD,4.0,6.0,MD
4,2015,MD,4.1,6.1,MD


Delete a column:

In [131]:
del df_3['state_dup']
df_3

Unnamed: 0,year,state,pop,unempl
0,2012,VA,5.0,
1,2013,VA,5.1,
2,2014,VA,5.2,6.0
3,2014,MD,4.0,6.0
4,2015,MD,4.1,6.1


Create a DataFrame from a nested dict of dicts (the keys in the inner dicts are unioned and sorted to form the index in the result, unless an explicit index is specified):

In [132]:
pop = {'VA' : {2013 : 5.1, 2014 : 5.2},
       'MD' : {2014 : 4.0, 2015 : 4.1}}
df_4 = DataFrame(pop)
df_4

Unnamed: 0,VA,MD
2013,5.1,
2014,5.2,4.0
2015,,4.1


In [133]:
data = {'weight':{ 2014 : 100, 2015 : 94, 2016 : 102}, 
       'reading':{2013 : 10, 2015 : 6, 2016 : 4}
       }
df = DataFrame(data)
df

Unnamed: 0,weight,reading
2014,100.0,
2015,94.0,6.0
2016,102.0,4.0
2013,,10.0


Transpose the DataFrame:

In [134]:
df_4.T

Unnamed: 0,2013,2014,2015
VA,5.1,5.2,
MD,,4.0,4.1


Create a DataFrame from a dict of Series:

In [135]:
data_2 = {'VA' : df_4['VA'][1:],
          'MD' : df_4['MD'][2:]}
df_5 = DataFrame(data_2)
df_5

Unnamed: 0,VA,MD
2014,5.2,
2015,,4.1


Set the DataFrame index name:

In [136]:
df_5.index.name = 'year'
df_5

Unnamed: 0_level_0,VA,MD
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2014,5.2,
2015,,4.1


Set the DataFrame columns name:

In [137]:
df_5.columns.name = 'state'
df_5

state,VA,MD
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2014,5.2,
2015,,4.1


Return the data contained in a DataFrame as a 2D ndarray:

In [138]:
df_5.values

array([[5.2, nan],
       [nan, 4.1]])

If the columns are different dtypes, the 2D ndarray's dtype will accomodate all of the columns:

In [139]:
df_3.values

array([[2012, 'VA', 5.0, nan],
       [2013, 'VA', 5.1, nan],
       [2014, 'VA', 5.2, 6.0],
       [2014, 'MD', 4.0, 6.0],
       [2015, 'MD', 4.1, 6.1]], dtype=object)

## Reindexing

Create a new object with the data conformed to a new index.  Any missing values are set to NaN.

In [140]:
df_3

Unnamed: 0,year,state,pop,unempl
0,2012,VA,5.0,
1,2013,VA,5.1,
2,2014,VA,5.2,6.0
3,2014,MD,4.0,6.0
4,2015,MD,4.1,6.1


Reindexing rows returns a new frame with the specified index:

In [141]:
df_3.reindex(list(reversed(range(0, 6))))

Unnamed: 0,year,state,pop,unempl
5,,,,
4,2015.0,MD,4.1,6.1
3,2014.0,MD,4.0,6.0
2,2014.0,VA,5.2,6.0
1,2013.0,VA,5.1,
0,2012.0,VA,5.0,


In [142]:
df_3.reindex(list(reversed((range(0,6)))))

Unnamed: 0,year,state,pop,unempl
5,,,,
4,2015.0,MD,4.1,6.1
3,2014.0,MD,4.0,6.0
2,2014.0,VA,5.2,6.0
1,2013.0,VA,5.1,
0,2012.0,VA,5.0,


In [143]:
df_3.reindex(reversed(range(0, 6)))

Unnamed: 0,year,state,pop,unempl
5,,,,
4,2015.0,MD,4.1,6.1
3,2014.0,MD,4.0,6.0
2,2014.0,VA,5.2,6.0
1,2013.0,VA,5.1,
0,2012.0,VA,5.0,


Missing values can be set to something other than NaN:

In [144]:
df_3.reindex(range(6, 0), fill_value=0)

Unnamed: 0,year,state,pop,unempl


Interpolate ordered data like a time series:

In [145]:
ser_5 = Series(['foo', 'bar', 'baz'], index=[0, 2, 4])
ser_5

0    foo
2    bar
4    baz
dtype: object

In [146]:
ser_5.reindex(range(5), method='ffill')

0    foo
1    foo
2    bar
3    bar
4    baz
dtype: object

In [147]:
ser_5.reindex(range(5), method='bfill')

0    foo
1    bar
2    bar
3    baz
4    baz
dtype: object

Reindex columns:

In [148]:
df_3.reindex(columns=['state', 'pop', 'unempl', 'year'])

Unnamed: 0,state,pop,unempl,year
0,VA,5.0,,2012
1,VA,5.1,,2013
2,VA,5.2,6.0,2014
3,MD,4.0,6.0,2014
4,MD,4.1,6.1,2015


Reindex rows and columns while filling rows:

In [149]:
df_3.reindex(index=list(reversed(range(0, 6))),
             fill_value=0,
             columns=['state', 'pop', 'unempl', 'year'])

Unnamed: 0,state,pop,unempl,year
5,0,0.0,0.0,0
4,MD,4.1,6.1,2015
3,MD,4.0,6.0,2014
2,VA,5.2,6.0,2014
1,VA,5.1,,2013
0,VA,5.0,,2012


pandas对ix函数进行升级 使用loc函数即可：

In [150]:
df_6 = df_3.loc[range(0, 5), ['state', 'pop', 'unempl', 'year']]
df_6

Unnamed: 0,state,pop,unempl,year
0,VA,5.0,,2012
1,VA,5.1,,2013
2,VA,5.2,6.0,2014
3,MD,4.0,6.0,2014
4,MD,4.1,6.1,2015


## Dropping Entries

Drop rows from a Series or DataFrame:

In [151]:
df_7 = df_6.drop([0])
df_7

Unnamed: 0,state,pop,unempl,year
1,VA,5.1,,2013
2,VA,5.2,6.0,2014
3,MD,4.0,6.0,2014
4,MD,4.1,6.1,2015


Drop columns from a DataFrame:

In [152]:
df_7 = df_7.drop('unempl', axis=1)
df_7

Unnamed: 0,state,pop,year
1,VA,5.1,2013
2,VA,5.2,2014
3,MD,4.0,2014
4,MD,4.1,2015


## Indexing, Selecting, Filtering

Series indexing is similar to NumPy array indexing with the added bonus of being able to use the Series' index values.

In [153]:
ser_2

a    1
b    1
c    2
d   -3
e   -5
dtype: int64

Select a value from a Series:

In [154]:
ser_2[0] == ser_2['a']

True

Select a slice from a Series:

In [155]:
ser_2[1:4]

b    1
c    2
d   -3
dtype: int64

Select specific values from a Series:

In [156]:
ser_2[['b', 'c', 'd']]

b    1
c    2
d   -3
dtype: int64

Select from a Series based on a filter:

In [157]:
ser_2[ser_2 > 0]

a    1
b    1
c    2
dtype: int64

Select a slice from a Series with labels (note the end point is inclusive):

In [158]:
ser_2['a':'b']

a    1
b    1
dtype: int64

Assign to a Series slice (note the end point is inclusive):

In [159]:
ser_2['a':'b'] = 0
ser_2

a    0
b    0
c    2
d   -3
e   -5
dtype: int64

Pandas supports indexing into a DataFrame.

In [160]:
df_6

Unnamed: 0,state,pop,unempl,year
0,VA,5.0,,2012
1,VA,5.1,,2013
2,VA,5.2,6.0,2014
3,MD,4.0,6.0,2014
4,MD,4.1,6.1,2015


Select specified columns from a DataFrame:

In [161]:
df_6[['pop', 'unempl']]

Unnamed: 0,pop,unempl
0,5.0,
1,5.1,
2,5.2,6.0
3,4.0,6.0
4,4.1,6.1


Select a slice from a DataFrame:

In [162]:
df_6[:2]

Unnamed: 0,state,pop,unempl,year
0,VA,5.0,,2012
1,VA,5.1,,2013


Select from a DataFrame based on a filter:

In [163]:
df_6[df_6['pop'] > 5]

Unnamed: 0,state,pop,unempl,year
1,VA,5.1,,2013
2,VA,5.2,6.0,2014


Perform a scalar comparison on a DataFrame:

In [164]:
str(df_6) > str(5)

False

Perform a scalar comparison on a DataFrame, retain the values that pass the filter:

In [165]:
df_6[df_6['pop'] > 5.0]

Unnamed: 0,state,pop,unempl,year
1,VA,5.1,,2013
2,VA,5.2,6.0,2014


Select a slice of rows from a DataFrame (note the end point is inclusive):

In [166]:
df_6.loc[2:3]

Unnamed: 0,state,pop,unempl,year
2,VA,5.2,6.0,2014
3,MD,4.0,6.0,2014


Select a slice of rows from a specific column of a DataFrame:

In [167]:
df_6.loc[0:2, 'pop']

0    5.0
1    5.1
2    5.2
Name: pop, dtype: float64

Select rows based on an arithmetic operation on a specific row:

In [168]:
df_6.loc[df_6.unempl > 5.0]

Unnamed: 0,state,pop,unempl,year
2,VA,5.2,6.0,2014
3,MD,4.0,6.0,2014
4,MD,4.1,6.1,2015


## Arithmetic and Data Alignment

Adding Series objects results in the union of index pairs if the pairs are not the same, resulting in NaN for indices that do not overlap:

In [169]:
np.random.seed(0)
ser_6 = Series(np.random.randn(5),
               index=['a', 'b', 'c', 'd', 'e'])
ser_6

a    1.764052
b    0.400157
c    0.978738
d    2.240893
e    1.867558
dtype: float64

In [170]:
np.random.seed(1)
ser_7 = Series(np.random.randn(5),
               index=['a', 'c', 'e', 'f', 'g'])
ser_7

a    1.624345
c   -0.611756
e   -0.528172
f   -1.072969
g    0.865408
dtype: float64

In [171]:
ser_6 + ser_7

a    3.388398
b         NaN
c    0.366982
d         NaN
e    1.339386
f         NaN
g         NaN
dtype: float64

Set a fill value instead of NaN for indices that do not overlap:

In [172]:
ser_6.add(ser_7, fill_value=0)

a    3.388398
b    0.400157
c    0.366982
d    2.240893
e    1.339386
f   -1.072969
g    0.865408
dtype: float64

Adding DataFrame objects results in the union of index pairs for rows and columns if the pairs are not the same, resulting in NaN for indices that do not overlap:

np.random.seed(0)
df_8 = DataFrame(np.random.rand(9).reshape((3, 3)),
                 columns=['a', 'b', 'c'])
df_8

In [173]:
np.random.seed(2)
df=DataFrame((np.random.rand(9).reshape(3, 3)), 
             columns=['a', 'b', 'c'])
df


Unnamed: 0,a,b,c
0,0.435995,0.025926,0.549662
1,0.435322,0.420368,0.330335
2,0.204649,0.619271,0.299655


In [174]:
np.random.seed(3)
df_1 = DataFrame(np.random.rand(9).reshape(3,3),
         columns=['a','b','c'])
df_1

Unnamed: 0,a,b,c
0,0.550798,0.708148,0.290905
1,0.510828,0.892947,0.896293
2,0.125585,0.207243,0.051467


In [175]:
np.random.seed(0)
data=np.random.randn(5)
data

array([1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799])

In [176]:
np.random.seed(0)
data = DataFrame(np.random.randn(9),
                columns=['a'])
data

Unnamed: 0,a
0,1.764052
1,0.400157
2,0.978738
3,2.240893
4,1.867558
5,-0.977278
6,0.950088
7,-0.151357
8,-0.103219


In [177]:
np.random.seed(1)
df_9 = DataFrame(np.random.rand(9).reshape((3, 3)),
                 columns=['b', 'c', 'd'])
df_9

Unnamed: 0,b,c,d
0,0.417022,0.720324,0.000114
1,0.302333,0.146756,0.092339
2,0.18626,0.345561,0.396767


In [182]:
df_8 = df_9.copy()

Set a fill value instead of NaN for indices that do not overlap:

In [183]:
df_10 = df_8.add(df_9, fill_value=0)
df_10

Unnamed: 0,b,c,d
0,0.834044,1.440649,0.000229
1,0.604665,0.293512,0.184677
2,0.37252,0.691121,0.793535


Like NumPy, pandas supports arithmetic operations between DataFrames and Series.

Match the index of the Series on the DataFrame's columns, broadcasting down the rows:

In [184]:
ser_8 = df_10.loc[0]
df_11 = df_10 - ser_8
df_11

Unnamed: 0,b,c,d
0,0.0,0.0,0.0
1,-0.229379,-1.147137,0.184448
2,-0.461524,-0.749528,0.793306


Match the index of the Series on the DataFrame's columns, broadcasting down the rows and union the indices that do not match:

In [185]:
ser_9 = Series(range(3), index=['a', 'd', 'e'])
ser_9

a    0
d    1
e    2
dtype: int64

In [186]:
df_11 - ser_9

Unnamed: 0,a,b,c,d,e
0,,,,-1.0,
1,,,,-0.815552,
2,,,,-0.206694,


Broadcast over the columns and match the rows (axis=0) by using an arithmetic method:

In [None]:
df_10

In [None]:
ser_10 = Series([100, 200, 300])
ser_10

In [None]:
df_10.sub(ser_10, axis=0)


## Function Application and Mapping

NumPy ufuncs (element-wise array methods) operate on pandas objects:

In [179]:
df_11 = np.abs(df_10)
df_11

NameError: name 'df_10' is not defined

Apply a function on 1D arrays to each column:

In [None]:
func_1 = lambda x: x.max() - x.min()
df_11.apply(func_1)

Apply a function on 1D arrays to each row:

In [None]:
df_11.apply(func_1, axis=1)


Apply a function and return a DataFrame:

In [None]:
func_2 = lambda x: Series([x.min(), x.max()], index=['min', 'max'])
df_11.apply(func_2)

Apply an element-wise Python function to a DataFrame:

In [None]:
func_3 = lambda x: '%.2f' %x
df_11.applymap(func_3)

Apply an element-wise Python function to a Series:

In [None]:
df_11['a'].map(func_3)

## Sorting and Ranking

In [None]:
ser_4

Sort a Series by its index:

In [None]:
ser_4.sort_index()

Sort a Series by its values:

In [None]:
ser_4.sort_values()

In [None]:
df_12 = DataFrame(np.arange(12).reshape((3, 4)),
                  index=['three', 'one', 'two'],
                  columns=['c', 'a', 'b', 'd'])
df_12


Sort a DataFrame by its index:

In [None]:
df_12.sort_index()

Sort a DataFrame by columns in descending order:

In [None]:
df_12.sort_index(axis=1, ascending=False)

Sort a DataFrame's values by column:

In [None]:
df_12.sort_values(by=['d', 'c'])

Ranking is similar to numpy.argsort except that ties are broken by assigning each group the mean rank:

In [None]:
ser_11 = Series([7, -5, 7, 4, 2, 0, 4, 7])
ser_11 = ser_11.sort_values()
ser_11

In [None]:
ser_11.rank()

Rank a Series according to when they appear in the data:

In [None]:
ser_11.rank(method='first')

Rank a Series in descending order, using the maximum rank for the group:

In [None]:
ser_11.rank(ascending=False, method='max')

DataFrames can rank over rows or columns.

In [None]:
df_13 = DataFrame({'foo' : [7, -5, 7, 4, 2, 0, 4, 7],
                   'bar' : [-5, 4, 2, 0, 4, 7, 7, 8],
                   'baz' : [-1, 2, 3, 0, 5, 9, 9, 5]})
df_13

Rank a DataFrame over rows:

In [None]:
df_13.rank()


Rank a DataFrame over columns:

In [None]:
df_13.rank(axis=1)

## Axis Indexes with Duplicate Values

Labels do not have to be unique in Pandas:

In [None]:
ser_12 = Series(range(5), index=['foo', 'foo', 'bar', 'bar', 'baz'])
ser_12


In [None]:
ser_12.index.is_unique

Select Series elements:

In [None]:
ser_12['foo']

Select DataFrame elements:

In [None]:
df_14 = DataFrame(np.random.randn(5, 4),
                  index=['foo', 'foo', 'bar', 'bar', 'baz'])
df_14

In [None]:
df_14.loc['bar']

## Summarizing and Computing Descriptive Statistics

Unlike NumPy arrays, Pandas descriptive statistics automatically exclude missing data.  NaN values are excluded unless the entire row or column is NA.

In [None]:
df_6

In [None]:
df_6.sum()

Sum over the rows:

In [None]:
df_6.sum(axis=1)

Account for NaNs:

In [None]:
df_6.sum(axis=1, skipna=False)

## Cleaning Data (Under Construction)
* Replace
* Drop
* Concatenate

In [None]:
from pandas import Series, DataFrame
import pandas as pd

Setup a DataFrame:

In [None]:
data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'population' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = DataFrame(data_1)
df_1

### Replace

Replace all occurrences of a string with another string, in place (no copy):

In [None]:
df_1.replace('VA', 'VIRGINIA', inplace=True)
df_1

In a specified column, replace all occurrences of a string with another string, in place (no copy):

In [None]:
df_1.replace({'state' : { 'MD' : 'MARYLAND' }}, inplace=True)
df_1

### Drop

Drop the 'population' column and return a copy of the DataFrame:

In [None]:
df_2 = df_1.drop('population', axis=1)
df_2

### Concatenate

Concatenate two DataFrames:

In [None]:
data_2 = {'state' : ['NY', 'NY', 'NY', 'FL', 'FL'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'population' : [6.0, 6.1, 6.2, 3.0, 3.1]}
df_3 = DataFrame(data_2)
df_3

In [None]:
df_4 = pd.concat([df_1, df_3])
df_4

## Input and Output (Under Construction)
* Reading
* Writing

In [None]:
from pandas import Series, DataFrame
import pandas as pd

### Reading

Read data from a CSV file into a DataFrame (use sep='\t' for TSV):

In [None]:
df_1 = pd.read_csv("data/ozone_copy.csv")

Get a summary of the DataFrame:

In [None]:
df_1.describe()

List the first five rows of the DataFrame:

In [None]:
df_1.head()

### Writing

Create a copy of the CSV file, encoded in UTF-8 and hiding the index and header labels:

In [None]:
df_1.to_csv('data/ozone_copy.csv', 
            encoding='utf-8', 
            index=False, 
            header=False)

View the data directory:

In [None]:
import os
print(os.listdir('./data'))