In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Series

### Chapter Goals

    Learn about the pandas Series object and its basic utilities
    Write code to create several Series objects


A. 1-D data

Similar to NumPy, pandas frequently deals with 1-D and 2-D data. However, we use two separate objects to deal with 1-D and 2-D data in pandas. For 1-D data, we use the pandas.Series objects, which we'll refer to simply as a Series.

A Series is created through the pd.Series constructor, which takes in no required arguments but does have a variety of keyword arguments.

The first keyword argument is data, which specifies the elements of the Series. If data is not set, pd.Series returns an empty Series. Since the data keyword argument is almost always used, we treat it like a regular first argument (i.e. skip the data= prefix).

Similar to the np.array constructor, pd.Series also takes in the dtype keyword argument for manual casting.

In [23]:
#create pandas Series objects using pd.Series
series = pd.Series()
print(series,"\n")
series = pd.Series(5)
print(series, "\n")
series = pd.Series([1,2,3])
print(series, "\n")
series = pd.Series([1, 2, 3.3]) #upcasting means all will be floats now
print('{}\n'.format(series))
arr = np.array([1,2,3])
series = pd.Series(arr, dtype = np.float32) #passing a numpy array
print(series, '\n')
series = pd.Series([1,2], [3,4])
print(series, '\n')


Series([], dtype: float64) 

0    5
dtype: int64 

0    1
1    2
2    3
dtype: int64 

0    1.0
1    2.0
2    3.3
dtype: float64

0    1.0
1    2.0
2    3.0
dtype: float32 

3    1
4    2
dtype: int64 



  


B. Index

In the previous examples, you may have noticed the zero-indexed integers to the left of the elements in each Series. These integers are collectively referred to as the index of a Series, and each individual index element is referred to as a label.

The default index is integers from 0 to n - 1, where n is the number of elements in the Series. However, we can specify a custom index via the index keyword argument of pd.Series.

The code below shows how to use the index keyword argument with pd.Series.

In [28]:
series = pd.Series([1,2,3], index = ['a', 'b', 'c'])
print(series, '\n')
series = pd.Series([2,3,4.5], index = [1, 2, '3'])
print(series, '\n')

a    1
b    2
c    3
dtype: int64 

1    2.0
2    3.0
3    4.5
dtype: float64 



C. Dictionary input

Another way to set the index of a Series is by using a Python dictionary for the data argument. The keys of the dictionary represent the index of the Series, while each individual key is the label for its corresponding value.

The code below shows how to use pd.Series with a Python dictionary as the first argument. In our example, we set 'a', 'b', and 'c' as the Series index, with corresponding values 1, 2, and 3.

In [32]:
series = pd.Series({"a": 1, "b": 2, "c": 3}) #Keys will be index or row labels and values are series values
print(series)

a    1
b    2
c    3
dtype: int64


### Time to Code!

In [35]:
'''The first Series we create will contain basic floating point numbers. 
The list we use to initialize the Series is [1, 3, 5.2]
'''
s1 = pd.Series([1, 3, 5.2])
print(s1)

0    1.0
1    3.0
2    5.2
dtype: float64


### The second Series we create comes from performing elemental multiplication on s1 using a separate list of floating point numbers.

Set s2 equal to s1 multiplied by [0.1, 0.2, 0.3]

In [36]:
s2 = s1* [0.1, 0.2, 0.3]
print(s2)

0    0.10
1    0.60
2    1.56
dtype: float64


### We'll create another Series, this time with integers. The list we use to initialize this Series is [1, 3, 8, np.nan]. This Series will also have row labels, which will be ['a', 'b', 'c', 'd'].

Set s3 equal to pd.Series with the specified list of integers as the first argument and the list of labels as the index keyword argument.

In [38]:
s3 = pd.Series([1, 3, 8, np.nan], index = ['a', 'b', 'c', 'd'])
print(s3)

a    1.0
b    3.0
c    8.0
d    NaN
dtype: float64


### The final Series we create will be initialized from a Python dictionary. The dictionary will have key-value pairs 'a':0, 'b':1, and 'c':2.

Set s4 equal to pd.Series with a dictionary of the specified key-value pairs as the only argument.

In [39]:
s4 = pd.Series({'a': 0, 'b': 1, 'c':2})
print(s4)

a    0
b    1
c    2
dtype: int64


### DataFrame

Learn about the pandas DataFrame object for 2-D data.
Chapter Goals:

    Learn about the pandas DataFrame object and its basic utilities
    Write code to create and manipulate a pandas DataFrame


A. 2-D data

One of the main purposes of pandas is to deal with tabular data, i.e. data that comes from tables or spreadsheets. Since tabular data contains rows and columns, it is 2-D. For working with 2-D data, we use the pandas.DataFrame object, which we'll refer to simply as a DataFrame.

A DataFrame is created through the pd.DataFrame constructor, which takes in essentially the same arguments as pd.Series. However, while a Series could be constructed from a scalar (representing a single value Series), a DataFrame cannot.

Furthermore, pd.DataFrame takes in an additional columns keyword argument, which represents the labels for the columns (similar to how index represents the row labels).

The code below shows how to use the pd.DataFrame constructor.

In [56]:
df = pd.DataFrame()
print(df, '\n')
df = pd.DataFrame([5,6])
print(df, '\n')
df = pd.DataFrame([5,6], [7,8]) #7,8 will be row labels or index

print(df, '\n')
df = pd.DataFrame([[5,6]])
print(df, '\n')
df = pd.DataFrame([[1,2], [5, 6]]) # 1,2 will be one row and 5,6 will be another one
print(df, '\n')
df = pd.DataFrame([[5,6], [1,3]], index = ['r1', 'r2'], columns= ['c1', 'c2'])
print(df, '\n')
df = pd.DataFrame({'c1':[1,2], 'c2': [3,4]}, index = ['r1', 'r2'])
print(df, '\n')


Empty DataFrame
Columns: []
Index: [] 

   0
0  5
1  6 

   0
7  5
8  6 

   0  1
0  5  6 

   0  1
0  1  2
1  5  6 

    c1  c2
r1   5   6
r2   1   3 

    c1  c2
r1   1   3
r2   2   4 



### B. Upcasting

When we initialize a DataFrame of mixed types, upcasting occurs on a per-column basis. The dtypes property returns the types in each column as a Series of types.

The code below shows how upcasting works in DataFrames. You'll notice that upcasting only occurs in the first column for the DataFrame below, because the second column's values are both integers.

In [58]:
upcast = pd.DataFrame([[5,6], [1.2, 3]]) # First column will upcast to floats
print(upcast, '\n')
print(upcast.dtypes)

     0  1
0  5.0  6
1  1.2  3 

0    float64
1      int64
dtype: object


C. Appending rows

We can append additional rows to a given DataFrame through the append function. The required argument for the function is either a Series or DataFrame, representing the row(s) we append.

Note that the append function returns the modified DataFrame but doesn't actually change the original. Furthermore, when we append a Series to the DataFrame, we either need to specify the name for the series or use the ignore_index keyword argument. Setting ignore_index=True will change the row labels to integer indexes.

The code below shows example usages of the append function.

In [73]:
df = pd.DataFrame([[5,6], [1.2, 3]])
print(df, '\n')
ser = pd.Series([0,0], name = 'r3')
print(ser, '\n')
df_app = df.append(ser) # Appending a row from a pd series object
print(df_app,'\n')
df_app = df.append(ser, ignore_index = True) # Giving defaut row index name
print(df_app,'\n')
df2 = pd.DataFrame([[3,0], [9,9]])
print(df2,'\n')
df_app = df.append(df2, ignore_index = True)
print(df_app,'\n')


     0  1
0  5.0  6
1  1.2  3 

0    0
1    0
Name: r3, dtype: int64 

      0  1
0   5.0  6
1   1.2  3
r3  0.0  0 

     0  1
0  5.0  6
1  1.2  3
2  0.0  0 

   0  1
0  3  0
1  9  9 

     0  1
0  5.0  6
1  1.2  3
2  3.0  0
3  9.0  9 



D. Dropping data

We can drop rows or columns from a given DataFrame through the drop function. There is no required argument, but the keyword arguments of the function gives us two ways to drop rows/columns from a DataFrame.

The first way is using the labels keyword argument to specify the labels of the rows/columns we want to drop. We use this alongside the axis keyword argument (which has default value of 0) to drop from the rows or columns axis.

The second method is to directly use the index or columns keyword arguments to specify the labels of the rows or columns directly, without needing to use axis.

The code below shows examples on how to use the drop function.
### Axis = 0 will be along row and Axis = 1 will be along column

In [92]:
df = pd.DataFrame({'c1': [1,2], 'c2': [3,4], 'c3': [5,6]}, index = ['r1', 'r2'])
print(df, '\n')
#Drop row r1
df_drop = df.drop(labels = 'r1')
print(df_drop, '\n')
df_drop = df.drop(labels=['c1', 'c3'], axis = 1) # means drop columns
print(df_drop, '\n')
df_drop = df.drop(index = 'r2')
print(df_drop, '\n')
df_drop = df.drop(columns = 'c2')
print(df_drop)
df_drop = df.drop(index = 'r2', columns = 'c2')
print(df_drop)
df_drop = df.drop('c1', axis = 1) #means remove column c1
print(df_drop)
df_drop = df.drop(columns = ['c1', 'c2']) # Simply meaning drop columns labelled c1 and c2
print(df_drop)
df_drop = df.drop(labels = ['c1', 'c3'], axis = 1) # If you use label, pls specify axis as well or 0 will be default
print(df_drop)

    c1  c2  c3
r1   1   3   5
r2   2   4   6 

    c1  c2  c3
r2   2   4   6 

    c2
r1   3
r2   4 

    c1  c2  c3
r1   1   3   5 

    c1  c3
r1   1   5
r2   2   6
    c1  c3
r1   1   5
    c2  c3
r1   3   5
r2   4   6
    c3
r1   5
r2   6
    c2
r1   3
r2   4


### Time to Code!

The coding exercise for this chapter involves creating various pandas DataFrame objects.

We'll first create a DataFrame from a Python dictionary. The dictionary will have key-value pairs 'c1':[0, 1, 2, 3] and 'c2':[5, 6, 7, 8], in that order.

The index for the DataFrame will come from the list of row labels ['r1', 'r2', 'r3', 'r4'].

Set df equal to pd.DataFrame with the specified dictionary as the first argument and the list of row labels as the index keyword argument.

In [94]:
df = pd.DataFrame({'c1': [0,1,2,3], 'c2': [5,6,7,8]}, index = ['r1', 'r2', 'r3', 'r4'])
print(df)

    c1  c2
r1   0   5
r2   1   6
r3   2   7
r4   3   8


We'll create another DataFrame, this one representing a single row. Rather than a dictionary for the first argument, we use a list of lists, and manually set the column labels to ['c1, 'c2'].

Since there is only one row, the row labels will be ['r5'].

Set row_df equal to pd.DataFrame with [[9, 9]] as the first argument, and the specified column and row labels for the columns and index keyword arguments.

In [95]:
row_df = pd.DataFrame([[9,9]], index = ['r5'], columns= ['c1', 'c2'])
print(row_df)

    c1  c2
r5   9   9


After creating row_df, we append it to the end of df and drop row 'r2'.

Set df_app equal to df.append with row_df as the only argument.

Then set df_drop equal to df_app.drop with 'r2' as the labels keyword argument.

In [98]:
print(df, '\n')
print(row_df, '\n')

    c1  c2
r1   0   5
r2   1   6
r3   2   7
r4   3   8 

    c1  c2
r5   9   9 



In [99]:
df_app = df.append(row_df)
df_app

Unnamed: 0,c1,c2
r1,0,5
r2,1,6
r3,2,7
r4,3,8
r5,9,9


In [101]:
df_drop = df_app.drop(labels = 'r2')
df_drop

Unnamed: 0,c1,c2
r1,0,5
r3,2,7
r4,3,8
r5,9,9


# Combining

Combine multiple DataFrames through concatenation and merging.
Chapter Goals:

    Understand the methods used to combine DataFrame objects
    Write code for combining DataFrames

In the previous chapter, we discussed the append function for concatenating DataFrame rows. To concatenate multiple DataFrames along either rows or columns, we use the pd.concat function.

The code below shows example usages of pd.concat.

In [102]:
df1 = pd.DataFrame({'c1': [1,2], 'c2': [3,4]}, index = ['r1', 'r2'])
df1

Unnamed: 0,c1,c2
r1,1,3
r2,2,4


In [103]:
df2 = pd.DataFrame({'c1': [5,6], 'c2': [7,8]}, index = ['r1', 'r2'])
df2

Unnamed: 0,c1,c2
r1,5,7
r2,6,8


In [104]:
df3 = pd.DataFrame({'c1': [5,6], 'c2':[7,8]})
df3

Unnamed: 0,c1,c2
0,5,7
1,6,8


In [106]:
concat = pd.concat([df1, df2], axis = 1)
concat

Unnamed: 0,c1,c2,c1.1,c2.1
r1,1,3,5,7
r2,2,4,6,8


In [108]:
concat = pd.concat([df2, df1, df3]) # This is concat along row
concat

Unnamed: 0,c1,c2
r1,5,7
r2,6,8
r1,1,3
r2,2,4
0,5,7
1,6,8


In [110]:
concat = pd.concat([df1, df3], axis = 1)
concat

Unnamed: 0,c1,c2,c1.1,c2.1
r1,1.0,3.0,,
r2,2.0,4.0,,
0,,,5.0,7.0
1,,,6.0,8.0


In the code example, the final call to pd.concat resulted in a DataFrame with many 
NaN values. This is because the row labels for df1 and df3 did not match, so result
was padded with NaN in locations where values did not exist.


The pd.concat function takes in a list of pandas objects (normally a list of DataFrames) to concatenate. The function also takes in numerous keyword arguments, with axis being one of the more important ones. The axis argument specifies whether we concatenate the rows (axis=0, the default), or concatenate the columns (axis=1).

In [112]:
mlb_df1 = pd.DataFrame({'name': ['john doe', 'al smith', 'sam black', 'john doe'],\
                       'pos': ['1B', 'C', 'P', '2B'], 'year': [2000, 2004, 2008, 2003]})
mlb_df1

Unnamed: 0,name,pos,year
0,john doe,1B,2000
1,al smith,C,2004
2,sam black,P,2008
3,john doe,2B,2003


In [113]:
mlb_df2 = pd.DataFrame({'name': ['john doe', 'al smith', 'jack lee'], 'year':\
                        [2000, 2004, 2012], 'rbi': [80, 100, 12]})
mlb_df2

Unnamed: 0,name,year,rbi
0,john doe,2000,80
1,al smith,2004,100
2,jack lee,2012,12


In [114]:
mlb_merged = pd.merge(mlb_df1, mlb_df2)
mlb_merged

Unnamed: 0,name,pos,year,rbi
0,john doe,1B,2000,80
1,al smith,C,2004,100


Without using any keyword arguments, pd.merge joins two DataFrames using all their common column labels. In the code example, the common labels between mlb_df1 and mlb_df2 were name and year.

The rows that contain the exact same values for the common column labels will be merged. Since 'john doe' for year 2000 was in both mlb_df1 and mlb_df2, its row was merged. However, 'john doe' for year 2003 was only in mlb_df1, so its row was not merged.

The pd.merge function takes in many keyword arguments, but often none are needed to properly merge two DataFrames.

### Time to Code!

The coding exercises for this chapter involve completing small functions that take in two DataFrame objects as input.

The first function, concat_rows will concatenate the rows of the two DataFrames.

Set row_concat equal to pd.concat with [df1, df2] as the only argument. Then return row_concat.

In [115]:
def concat_rows(df1, df2):
    row_concat = pd.concat([df1, df2])
    return row_concat

In [116]:
concat_rows(df1, df2) #take common columns label in consideration

Unnamed: 0,c1,c2
r1,1,3
r2,2,4
r1,5,7
r2,6,8


In [117]:
def concat_cols(df1, df2):
    col_concat = pd.concat([df1, df2], axis = 1)
    return col_concat

The next function, concat_cols will concatenate the columns of the two input DataFrames.

Set col_concat equal to pd.concat with [df1, df2] as the required argument. Also set the axis keyword argument to 1.

Then return col_concat.

In [119]:
concat_cols(df1, df2) # Take common rows labels in consideration

Unnamed: 0,c1,c2,c1.1,c2.1
r1,1,3,5,7
r2,2,4,6,8


The final function, merge_dfs will merge the two input DataFrames along their columns.

Set merged_df equal to pd.merge with df1 and df2 as the first and second arguments, respectively.

Then return merged_df.

In [122]:
def merge_dfs(df1, df2):
    merged_df = pd.merge(df1, df2)
    return merged_df

In [124]:
merge_dfs(df1, df2) #They have nothing in common

Unnamed: 0,c1,c2


## Indexing

Understand how DataFrame values can be accessed via indexing.
Chapter Goals:

    Learn how to index a DataFrame to retrieve rows and columns
    Write code for indexing a DataFrame


A. Direct indexing

When indexing into a DataFrame, we can treat the DataFrame as a dictionary of Series objects, where each column represents a Series. Each column label then becomes a key, allowing us to directly retrieve columns using dictionary-like bracket notation.

The code below shows how to directly index into a DataFrame's columns.

In [127]:
df = pd.DataFrame({'c1': [1,2], 'c2': [3,4], 'c3': [5,6]}, index = ['r1', 'r2'])
df

Unnamed: 0,c1,c2,c3
r1,1,3,5
r2,2,4,6


In [131]:
col1 = df['c1']
print(col1)
type(col1) # This is a series


r1    1
r2    2
Name: c1, dtype: int64


pandas.core.series.Series

In [134]:
col1_df = df[['c1']]
col1_df #This is a dataframe, list of lists gives us df

Unnamed: 0,c1
r1,1
r2,2


In [136]:
col23 = df[['c2', 'c3']]
col23

Unnamed: 0,c2,c3
r1,3,5
r2,4,6


Note that when we use a single column label inside the bracket (as was the case for col1 in the code example), the output is a Series representing the corresponding column. When we use a list of column labels (as was the case for col1_df and col23), the output is a DataFrame that contains the corresponding columns.

We can also use direct indexing to retrieve a subset of the rows (as a DataFrame). However, we can only retrieve rows based on slices, rather than specifying particular rows.

The code below shows how to directly index into a DataFrame's rows.

In [139]:
df = pd.DataFrame({'c1': [1,2,3], 'c2': [4,5,6], 'c3': [7,8,9]}, index = ['r1', 'r2', 'r3'])
df

Unnamed: 0,c1,c2,c3
r1,1,4,7
r2,2,5,8
r3,3,6,9


In [141]:
first_two_rows = df[0:2]
first_two_rows

Unnamed: 0,c1,c2,c3
r1,1,4,7
r2,2,5,8


In [143]:
last_two_rows = df['r2': 'r3']
last_two_rows

Unnamed: 0,c1,c2,c3
r2,2,5,8
r3,3,6,9


In [145]:
#Results in KeyError
# df['r1']

You'll notice that when we used integer indexing for the rows, the end index was exclusive (e.g. first_two_rows excluded the row at index 2). However, when we use row labels, the end index is inclusive (e.g. last_two_rows included the row labeled 'r3').

Furthermore, when we tried to retrieve a single row based on its label, we received a KeyError. This is because the DataFrame treated 'r1' as a column label.

### B. Other indexing

Apart from direct indexing, a DataFrame object also contains the loc and iloc properties for indexing.

We use iloc to access rows based on their integer index. Using iloc we can access a single row as a Series, and specify particular rows to access through a list of integers or a boolean array.

The code below shows how to use iloc to access a DataFrame's rows.

In [147]:
df = pd.DataFrame({'c1': [1,2,3], 'c2': [4,5,6], 'c3': [7,8,9]}, index = ['r1', 'r2', 'r3'])
df

Unnamed: 0,c1,c2,c3
r1,1,4,7
r2,2,5,8
r3,3,6,9


In [148]:
df.iloc[1] # Row 2 as a series

c1    2
c2    5
c3    8
Name: r2, dtype: int64

In [150]:
df.iloc[[0,2]] # Rows 1 and 3

Unnamed: 0,c1,c2,c3
r1,1,4,7
r3,3,6,9


In [153]:
bool_list = [False, True, True]
df.iloc[bool_list] #means gather only the second and the third rows

Unnamed: 0,c1,c2,c3
r2,2,5,8
r3,3,6,9


The loc property provides the same row indexing functionality as iloc, but uses row labels rather than integer indexes. Furthermore, with loc we can perform column indexing along with row indexing, and set new values in a DataFrame for specific rows and columns.

The code below shows example usages of loc.

In [154]:
df = pd.DataFrame({'c1': [1,2,3], 'c2': [4,5,6], 'c3': [7,8,9]}, index = ['r1', 'r2', 'r3'])
df

Unnamed: 0,c1,c2,c3
r1,1,4,7
r2,2,5,8
r3,3,6,9


In [155]:
df.loc['r2']

c1    2
c2    5
c3    8
Name: r2, dtype: int64

In [156]:
bool_list = [False, True, True]
df.loc[bool_list]

Unnamed: 0,c1,c2,c3
r2,2,5,8
r3,3,6,9


In [157]:
df

Unnamed: 0,c1,c2,c3
r1,1,4,7
r2,2,5,8
r3,3,6,9


In [158]:
df.loc['r1', 'c2'] #First row, second column

4

In [160]:
df

Unnamed: 0,c1,c2,c3
r1,1,4,7
r2,2,5,8
r3,3,6,9


In [161]:
df.loc[['r1', 'r3'], 'c2']

r1    4
r3    6
Name: c2, dtype: int64

In [162]:
df

Unnamed: 0,c1,c2,c3
r1,1,4,7
r2,2,5,8
r3,3,6,9


In [163]:
df.loc[['r1', 'r3'], 'c2'] = 0
df

Unnamed: 0,c1,c2,c3
r1,1,0,7
r2,2,5,8
r3,3,0,9


You'll notice that the way we access rows and columns together with loc is similar to how we access 2-D NumPy arrays.

Since we can't access columns on their own with loc or iloc, we still use bracket indexing when retrieving columns of a DataFrame.

In [164]:
df

Unnamed: 0,c1,c2,c3
r1,1,0,7
r2,2,5,8
r3,3,0,9


### Time to Code!

The coding exercises for this chapter involve directly indexing into a predefined DataFrame, df.

We'll initially use direct indexing to get the first column of df as well as the first two rows.

Set col_1 equal to df directly indexed with 'c1' as the key.

Set row_12 equal to df directly indexed with 0:2 as the key.

In [166]:
col_1 = df['c1']
col_1

r1    1
r2    2
r3    3
Name: c1, dtype: int64

In [168]:
row_12 = df[0:2]
row_12

Unnamed: 0,c1,c2,c3
r1,1,0,7
r2,2,5,8


Next, we'll use iloc to retrieve the first and third rows of df.

Set row_13 equal to df.iloc indexed with [0, 2] as the key.

In [170]:
row_13 = df.iloc[[0,2]] #Remember dataframe
row_13

Unnamed: 0,c1,c2,c3
r1,1,0,7
r3,3,0,9


Finally, we use loc to set each value of the second column, in the third and fourth rows, equal to 12. The row key we use for indexing will be ['r3','r4'], while the column key will be 'c2'.

Set df.loc, indexed with the specified row and column keys, equal to 12.

In [171]:
df

Unnamed: 0,c1,c2,c3
r1,1,0,7
r2,2,5,8
r3,3,0,9


In [173]:
df.loc[['r2', 'r3'], 'c2'] = 12
df

Unnamed: 0,c1,c2,c3
r1,1,0,7
r2,2,12,8
r3,3,12,9
