# Day 3 -- analysis

0. Your questions 
1. Sorting our data
2. Grouping
3. Multi-indexes and grouping
4. Pivot tables
5. Cleaning our data
6. Plotting and visualization
7. What's next?

# Sorting our data

One of the most important topics in computer science in sorting. 

Do we really need to sort our data in Pandas?  Maybe.

But if we want the 5 largest values. If we want the 10 smallest values.

There are multiple ways to sort:

- Sort by the index
- Sort by a column (if it's a data frame)
- Sort by multiple columns (if it's a data frame)

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
import numpy as np

In [3]:
np.random.seed(0)   # this resets the random number generator to be deterministic
s = Series(np.random.randint(-50, 50, 10))
s

0    -6
1    -3
2    14
3    17
4    17
5   -41
6    33
7   -29
8   -14
9    37
dtype: int64

In [4]:
# These are sorted by index -- because we created teh series with the index
# I want to sort these by value!

s.sort_values()  # this returns a new series, based on s, with the same index/values, but sorted by value

5   -41
7   -29
8   -14
0    -6
1    -3
2    14
3    17
4    17
6    33
9    37
dtype: int64

In [5]:
s

0    -6
1    -3
2    14
3    17
4    17
5   -41
6    33
7   -29
8   -14
9    37
dtype: int64

In [6]:
# in theory, you could invoke sort_values with the inplace=True keyword argument
# but don't do that!

s = s.sort_values()  # this is the right way to do it!
s

5   -41
7   -29
8   -14
0    -6
1    -3
2    14
3    17
4    17
6    33
9    37
dtype: int64

In [7]:
# what if I want to sort from biggest to smallest?

s.sort_values(ascending=False)  

9    37
6    33
3    17
4    17
2    14
1    -3
0    -6
8   -14
7   -29
5   -41
dtype: int64

In [8]:
np.random.seed(0)   # this resets the random number generator to be deterministic
s = Series(np.random.randint(-50, 50, 10),
          index=list('fdegabhijc'))
s


f    -6
d    -3
e    14
g    17
a    17
b   -41
h    33
i   -29
j   -14
c    37
dtype: int64

In [9]:
s.sort_values()

b   -41
i   -29
j   -14
f    -6
d    -3
e    14
g    17
a    17
h    33
c    37
dtype: int64

In [10]:
s.sort_index()

a    17
b   -41
c    37
d    -3
e    14
f    -6
g    17
h    33
i   -29
j   -14
dtype: int64

In [11]:
# let's set an index that repeats

np.random.seed(0)   # this resets the random number generator to be deterministic
s = Series(np.random.randint(-50, 50, 10),
          index=list('abcdefabcd'))
s


a    -6
b    -3
c    14
d    17
e    17
f   -41
a    33
b   -29
c   -14
d    37
dtype: int64

In [12]:
s.loc['a']

a    -6
a    33
dtype: int64

In [13]:
s.loc['f']

np.int64(-41)

In [14]:
s.loc['b':'e']

KeyError: "Cannot get left slice bound for non-unique label: 'b'"

In [15]:
# if you sort your index, then this error goes away!

s.sort_index().loc['b':'e']

b    -3
b   -29
c    14
c   -14
d    17
d    37
e    17
dtype: int64

In [16]:
help(s.sort_values)

Help on method sort_values in module pandas.core.series:

sort_values(
    *,
    axis: 'Axis' = 0,
    ascending: 'bool | Sequence[bool]' = True,
    inplace: 'bool' = False,
    kind: 'SortKind' = 'quicksort',
    na_position: 'NaPosition' = 'last',
    ignore_index: 'bool' = False,
    key: 'ValueKeyFunc | None' = None
) -> 'Series | None' method of pandas.core.series.Series instance
    Sort by the values.

    Sort a Series in ascending or descending order by some
    criterion.

    Parameters
    ----------
    axis : {0 or 'index'}
        Unused. Parameter needed for compatibility with DataFrame.
    ascending : bool or list of bools, default True
        If True, sort values in ascending order, otherwise descending.
    inplace : bool, default False
        If True, perform operation in-place.
    kind : {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'
        Choice of sorting algorithm. See also :func:`numpy.sort` for more
        information. 'mergesor

In [17]:
s


a    -6
b    -3
c    14
d    17
e    17
f   -41
a    33
b   -29
c   -14
d    37
dtype: int64

In [19]:
s = Series([-10, 10, 20, -20])
s.sort_values(key=abs)   # this means: sort them by absolute value!

0   -10
1    10
2    20
3   -20
dtype: int64

In [20]:
# what if I want the 3 smallest numbers in s?

np.random.seed(0)   # this resets the random number generator to be deterministic
s = Series(np.random.randint(-50, 50, 10),
          index=list('abcdefabcd'))

s.sort_values().head(3)

f   -41
b   -29
c   -14
dtype: int64

In [21]:
# what about the 3 largest ones?

s.sort_values().tail(3)

e    17
a    33
d    37
dtype: int64

In [22]:
s.sort_values(ascending=False).head(3)

d    37
a    33
d    17
dtype: int64

# What about data frames?

Everything I said is true about data frames, also:

- We have `sort_index`
- We have `sort_values`, but we have to specify the column on which we want to sort

In [23]:
np.random.seed(0)
df = DataFrame(np.random.randint(-50, 50, [4, 5]),
              index=list('abcd'),
              columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,-6,-3,14,17,17
b,-41,33,-29,-14,37
c,20,38,38,-38,8
d,15,-11,37,-4,38


In [24]:
df.sort_values('v')   # sort in increasing order by column v

Unnamed: 0,v,w,x,y,z
b,-41,33,-29,-14,37
a,-6,-3,14,17,17
d,15,-11,37,-4,38
c,20,38,38,-38,8


In [25]:
df.sort_values('v', ascending=False)   # sort in increasing order by column v

Unnamed: 0,v,w,x,y,z
c,20,38,38,-38,8
d,15,-11,37,-4,38
a,-6,-3,14,17,17
b,-41,33,-29,-14,37


In [26]:
df.loc['d', 'w'] = -3
df

Unnamed: 0,v,w,x,y,z
a,-6,-3,14,17,17
b,-41,33,-29,-14,37
c,20,38,38,-38,8
d,15,-3,37,-4,38


In [27]:
df.sort_values('w')

Unnamed: 0,v,w,x,y,z
a,-6,-3,14,17,17
d,15,-3,37,-4,38
b,-41,33,-29,-14,37
c,20,38,38,-38,8


In [28]:
df.sort_values(['w', 'y'])  # it'll sort by w, but if there's a tie, it will use y

Unnamed: 0,v,w,x,y,z
d,15,-3,37,-4,38
a,-6,-3,14,17,17
b,-41,33,-29,-14,37
c,20,38,38,-38,8


In [29]:
df.sort_values(['w', 'y'], ascending=False)

Unnamed: 0,v,w,x,y,z
c,20,38,38,-38,8
b,-41,33,-29,-14,37
a,-6,-3,14,17,17
d,15,-3,37,-4,38


In [30]:
# what if I want w in increasing order, and y in decreasing order?

df.sort_values(['w', 'y'], ascending=[True, False])   # this means: w is ascending, y is descending

Unnamed: 0,v,w,x,y,z
a,-6,-3,14,17,17
d,15,-3,37,-4,38
b,-41,33,-29,-14,37
c,20,38,38,-38,8


# Exercise: Temperatures

1. Create a data frame with two columns, `highs` and `lows`, for the 10-day forecast starting today. The index should contain strings of the form `MMDD`, with a two-digit month and two-digit day.
2. Find the three days with the highest temperatures.
3. Find the three days with the biggest differences between (forecast) high and low temps.
4. Would you, for any of this, need to sort by the index?

In [32]:
df = DataFrame({'highs':[25, 27, 29, 29, 32, 33, 36, 29, 27, 27],
               'lows':[14, 15, 16, 17, 19, 21, 20, 17, 16, 16]},
              index='0519 0520 0521 0522 0523 0524 0525 0526 0527 0528'.split())
df

Unnamed: 0,highs,lows
519,25,14
520,27,15
521,29,16
522,29,17
523,32,19
524,33,21
525,36,20
526,29,17
527,27,16
528,27,16


In [34]:
df.sort_values('highs').tail(3)

Unnamed: 0,highs,lows
523,32,19
524,33,21
525,36,20


In [35]:
df['diffs'] = df['highs'] - df['lows']
df

Unnamed: 0,highs,lows,diffs
519,25,14,11
520,27,15,12
521,29,16,13
522,29,17,12
523,32,19,13
524,33,21,12
525,36,20,16
526,29,17,12
527,27,16,11
528,27,16,11


In [37]:
df.sort_values('diffs').tail(3)

Unnamed: 0,highs,lows,diffs
521,29,16,13
523,32,19,13
525,36,20,16


In [38]:
df = df.sort_values('diffs')

In [39]:
df

Unnamed: 0,highs,lows,diffs
519,25,14,11
527,27,16,11
528,27,16,11
520,27,15,12
522,29,17,12
524,33,21,12
526,29,17,12
521,29,16,13
523,32,19,13
525,36,20,16


In [40]:
df.sort_index()

Unnamed: 0,highs,lows,diffs
519,25,14,11
520,27,15,12
521,29,16,13
522,29,17,12
523,32,19,13
524,33,21,12
525,36,20,16
526,29,17,12
527,27,16,11
528,27,16,11


In [41]:
# there are two great methods that do this for us!
# nlargest and nsmalleset

df.nlargest(columns='diffs', n=3)

Unnamed: 0,highs,lows,diffs
525,36,20,16
521,29,16,13
523,32,19,13


In [42]:
df.nsmallest(columns='diffs', n=3)

Unnamed: 0,highs,lows,diffs
519,25,14,11
527,27,16,11
528,27,16,11


# Grouping!

Let's redo our temperature data to use days of the week


In [43]:
df = DataFrame({'highs':[25, 27, 29, 29, 32, 33, 36, 29, 27, 27],
               'lows':[14, 15, 16, 17, 19, 21, 20, 17, 16, 16]},
              index='Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed'.split())
df

Unnamed: 0,highs,lows
Mon,25,14
Tue,27,15
Wed,29,16
Thu,29,17
Fri,32,19
Sat,33,21
Sun,36,20
Mon,29,17
Tue,27,16
Wed,27,16


In [44]:
# I want to find out the mean high temp on Mondays

df.loc['Mon']

Unnamed: 0,highs,lows
Mon,25,14
Mon,29,17
