In [2]:
import numpy as np
import pandas as pd

We will start with the practice of basic operations in Pandas. It is very important to get familiar with this stuff because we will be using it again and again throughout this course. We will cover an introduction to Pandas, specifically:
* Attributes of Pandas objects
* Counting values in Series
* Altering labels
* .dt and .str accessors
* Sorting

# Pandas Basics I
One of the great things about the frequently used Python packages is that their documentation is really good. We can usually easily google anything we want to do in Pandas. We will also be working intensively with the official documentation throughout this module and the course.

## Attributes of Pandas objects

```Pandas``` objects have a number of attributes enabling us to access metadata:

* Shape: gives the axis dimensions of the object, consistent with ndarray

* Axis labels:
    * Series: index (only axis)
    * DataFrame: index (rows) and columns

In [6]:
df = pd.DataFrame(np.random.randn(8, 3),
                      columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,-0.104997,-0.776873,-0.512974
1,0.379473,-1.964233,0.127947
2,0.14527,1.348744,0.897218
3,-0.236888,0.185895,-0.184648
4,0.474092,1.58804,1.143292
5,-0.727644,2.138935,0.783996
6,-1.083042,0.536353,1.725506
7,-0.237614,0.730162,-0.149377


In [5]:
df.columns = [x.lower() for x in df.columns]
df

Unnamed: 0,a,b,c
0,0.004499,0.031106,0.058377
1,-0.68789,-0.048437,-0.38152
2,-0.383353,0.912692,-0.782075
3,2.030198,2.446915,-0.46862
4,-0.634834,0.579532,-0.290633
5,-0.236489,0.372043,-0.728095
6,-0.272154,0.473083,-0.257541
7,1.166249,-0.209682,-2.568596


We can think of the Pandas objects (Index, Series, DataFrame) as containers for arrays, which hold the actual data and do the actual computation. To get the actual data inside an Index or Series, use the attribute `.array`.

In [None]:
df.a.array

## Counting values in Series
The `value_counts()` Series method and top-level function computes a histogram of a 1D array of values.

In [13]:
data = np.random.randint(0, 7, size=50)
data

array([1, 5, 3, 1, 3, 4, 6, 4, 1, 4, 1, 0, 6, 4, 5, 3, 5, 0, 4, 4, 6, 0,
       4, 1, 5, 3, 4, 2, 2, 5, 6, 3, 5, 1, 5, 5, 6, 0, 0, 4, 0, 5, 3, 5,
       3, 4, 1, 6, 0, 5])

In [14]:
s = pd.Series(data)
s.value_counts() ## !!

5    11
4    10
0     7
1     7
3     7
6     6
2     2
dtype: int64

Similarly, we can get the most frequently occurring value(s) (`mode()`) of the values in a Series or DataFrame.

In [15]:
s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])
s5.mode()

0    3
1    7
dtype: int64

In [18]:
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
                       "B": np.random.randint(-10, 15, size=50)})

In [19]:
df5.mode()

Unnamed: 0,A,B
0,2,-9


    Even though `mode()` can be called on both Series and DataFrame, `value_counts()` can only be used on 1D arrays, therefore, not on DataFrames.

# Altering labels

## Reindexing

```reindex()``` is the fundamental data alignment method in Pandas. It is used to implement nearly all other features relying on a label-alignment functionality. To reindex means to conform the data to match a given set of labels along a particular axis. This accomplishes several things:

* Reorders the existing data to match a new set of labels
* Inserts missing value (NA) markers in label locations where no data for that label existed

In [21]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -0.309856
b   -0.065706
c    1.586581
d   -1.418259
e   -0.703846
dtype: float64

In [24]:
s.reindex(['e', 'b', 'f', 'd']) # We can see that we have NaN for the index f. 
                                    # This happens because we didn't have a label f in the original Series.

e   -0.703846
b   -0.065706
f         NaN
d   -1.418259
dtype: float64

With a DataFrame, we can simultaneously reindex the index and columns:

In [25]:
df = pd.DataFrame({
     'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
     'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
     'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,two,three
a,-0.191469,-0.202602,
b,-1.577498,-0.330737,-0.660448
c,0.59474,0.233008,-1.266313
d,,0.874914,0.226095


In [27]:
df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one']) # reindex both the index and the columns

Unnamed: 0,three,two,one
c,-1.266313,0.233008,0.59474
f,,,
b,-0.660448,-0.330737,-1.577498


We may also use reindex with an axis keyword:

In [28]:
df.reindex(['c', 'f', 'b'], axis='index')

Unnamed: 0,one,two,three
c,0.59474,0.233008,-1.266313
f,,,
b,-1.577498,-0.330737,-0.660448


Index objects containing the actual axis labels can be shared between objects. So if we have a Series and a DataFrame, the following can be done: 

In [30]:
rs = s.reindex(df.index)
rs

a   -0.309856
b   -0.065706
c    1.586581
d   -1.418259
dtype: float64

# Dropping labels from an axis

A method closely related to reindex is the ```drop()``` function. It removes a set of labels from an axis:

In [32]:
df

Unnamed: 0,one,two,three
a,-0.191469,-0.202602,
b,-1.577498,-0.330737,-0.660448
c,0.59474,0.233008,-1.266313
d,,0.874914,0.226095


In [33]:
df.drop(['a', 'd'], axis=0)

Unnamed: 0,one,two,three
b,-1.577498,-0.330737,-0.660448
c,0.59474,0.233008,-1.266313


In [34]:
df.drop(['one'], axis=1)

Unnamed: 0,two,three
a,-0.202602,
b,-0.330737,-0.660448
c,0.233008,-1.266313
d,0.874914,0.226095


# Renaming

The ```rename()``` method allows us to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [36]:
s

a   -0.309856
b   -0.065706
c    1.586581
d   -1.418259
e   -0.703846
dtype: float64

In [37]:
s.rename(str.upper)

A   -0.309856
B   -0.065706
C    1.586581
D   -1.418259
E   -0.703846
dtype: float64

A ```dict``` or ```Series``` can also be used:

In [38]:
df.rename(columns={'one': 'foo', 'two': 'bar'},
              index={'a': 'apple', 'b': 'banana', 'd': 'durian'})

Unnamed: 0,foo,bar,three
apple,-0.191469,-0.202602,
banana,-1.577498,-0.330737,-0.660448
c,0.59474,0.233008,-1.266313
durian,,0.874914,0.226095


```DataFrame.rename()``` also supports an “axis-style” calling convention, where we specify a single mapper and an axis to apply that mapping to.

In [40]:
df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')

Unnamed: 0,foo,bar,three
a,-0.191469,-0.202602,
b,-1.577498,-0.330737,-0.660448
c,0.59474,0.233008,-1.266313
d,,0.874914,0.226095


In [41]:
df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index')

Unnamed: 0,one,two,three
apple,-0.191469,-0.202602,
banana,-1.577498,-0.330737,-0.660448
c,0.59474,0.233008,-1.266313
durian,,0.874914,0.226095


# ```.dt``` and ```.str``` accessors

## ```.dt```

```Series``` has an accessor to succinctly return datetime-like properties for the values of the Series, if it is a datetime/period-like Series. This will return a Series, indexed like an existing Series.

In [44]:
s = pd.Series(pd.date_range('20130101 09:10:12', periods=4))
s

0   2013-01-01 09:10:12
1   2013-01-02 09:10:12
2   2013-01-03 09:10:12
3   2013-01-04 09:10:12
dtype: datetime64[ns]

In [45]:
s.dt.hour

0    9
1    9
2    9
3    9
dtype: int64

In [46]:
s.dt.second

0    12
1    12
2    12
3    12
dtype: int64

In [47]:
s.dt.day

0    1
1    2
2    3
3    4
dtype: int64

In [48]:
s.dt.dayofweek

0    1
1    2
2    3
3    4
dtype: int64

We can easily produce timezone-aware transformations:

In [50]:
stz = s.dt.tz_localize('US/Eastern')
stz

0   2013-01-01 09:10:12-05:00
1   2013-01-02 09:10:12-05:00
2   2013-01-03 09:10:12-05:00
3   2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]

In [51]:
stz.dt.tz # call what timezone it is

<DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>

In [52]:
s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

0   2013-01-01 04:10:12-05:00
1   2013-01-02 04:10:12-05:00
2   2013-01-03 04:10:12-05:00
3   2013-01-04 04:10:12-05:00
dtype: datetime64[ns, US/Eastern]

## ```.str```

Series is equipped with a set of **string processing methods** that make it easy to operate on each element of the array. These are accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in string methods

In [57]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'],
                  dtype="string")
s

0       A
1       B
2       C
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: string

In [56]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

Using `.str` accessor, we can apply all string functions from standard Python to our Series.

# Sorting
There are three types of sorting in Pandas: 1. Sorting by index labels 2. Sorting by column values 3. Sorting by a combination of both

## by index

The ```Series.sort_index()``` and ```DataFrame.sort_index()``` methods are used to sort a Pandas object by its index levels.

In [61]:
df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

In [63]:
unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
                          columns=['three', 'two', 'one'])
unsorted_df

Unnamed: 0,three,two,one
a,,-0.302051,-0.093675
d,-1.107556,0.377732,
c,0.021736,0.146348,0.730347
b,1.059159,0.238043,0.84251


In [65]:
# Sort DataFrame by index
unsorted_df.sort_index()
unsorted_df.sort_index(ascending=False)

Unnamed: 0,three,two,one
d,-1.107556,0.377732,
c,0.021736,0.146348,0.730347
b,1.059159,0.238043,0.84251
a,,-0.302051,-0.093675


In [66]:
# Sort DataFrame by column names
unsorted_df.sort_index(axis=1)

Unnamed: 0,one,three,two
a,-0.093675,,-0.302051
d,,-1.107556,0.377732
c,0.730347,0.021736,0.146348
b,0.84251,1.059159,0.238043


In [67]:
# Sort Series by index
unsorted_df['three'].sort_index()

a         NaN
b    1.059159
c    0.021736
d   -1.107556
Name: three, dtype: float64

## by values

The ```Series.sort_values()``` method is used to sort a Series by its values. The ```DataFrame.sort_values()``` method is used to sort a DataFrame by its column or row values.

In [70]:
df1 = pd.DataFrame({'one': [2, 1, 1, 1],
                        'two': [1, 3, 2, 4],
                        'three': [5, 4, 3, 2]})
df1

Unnamed: 0,one,two,three
0,2,1,5
1,1,3,4
2,1,2,3
3,1,4,2


In [71]:
# Sort DataFrame by column "two"
df1.sort_values(by='two')

Unnamed: 0,one,two,three
0,2,1,5
2,1,2,3
1,1,3,4
3,1,4,2


In [72]:
# Sort DataFrame by columns "one" and "two"
df1[['one', 'two', 'three']].sort_values(by=['one', 'two'])

Unnamed: 0,one,two,three
2,1,2,3
1,1,3,4
3,1,4,2
0,2,1,5


These methods have a special treatment of NA values via the na_position argument:

In [73]:
s[2] = np.nan
s.sort_values()

0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
2    <NA>
5    <NA>
dtype: string

In [74]:
s.sort_values(na_position='first')

2    <NA>
5    <NA>
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
dtype: string

`by` parameter in `sort_values()` method can refer to either columns or index level names.

We can use the name of the index to sort by both an index and a column.

In [75]:
# Build MultiIndex
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
                                   ('b', 2), ('b', 1), ('b', 1)])
idx.names = ['first', 'second']

In [76]:
# Build DataFrame
df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
                            index=idx)
df_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,A
first,second,Unnamed: 2_level_1
a,1,6
a,2,5
a,2,4
b,2,3
b,1,2
b,1,1


In [77]:
# Sort DataFrame by 'second' (index) and 'A' (column)
df_multi.sort_values(by=['second', 'A'])

Unnamed: 0_level_0,Unnamed: 1_level_0,A
first,second,Unnamed: 2_level_1
b,1,1
b,1,2
a,1,6
b,2,3
a,2,4
a,2,5
