# Data Manipulation with pandas

In [1]:
# Setup header
import pandas as pd

pd.set_option('display.max_rows', None)

homelessness = pd.read_csv('homelessness.csv', index_col=0)
sales = pd.read_csv('sales_subset.csv', index_col=0)

sales['date'] = pd.to_datetime(sales['date'])

## Transforming `DataFrame`s

### Inspecting a `DataFrame`

Here are a few ways to summarize pandas `DataFrame`s. `.head()` gives a quick rundown of what the `DataFrame` looks like without any attempt at summarizing the data:

In [2]:
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570.0,864.0,4887681
1,Pacific,Alaska,1434.0,582.0,735139
2,Mountain,Arizona,7259.0,2606.0,7158024
3,West South Central,Arkansas,2280.0,432.0,3009733
4,Pacific,California,109008.0,20964.0,39461588


`.info()` will show the type of each column and give the user an idea of how many missing values there are:

In [3]:
homelessness.info()

<class 'pandas.core.frame.DataFrame'>
Index: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   region          51 non-null     object 
 1   state           51 non-null     object 
 2   individuals     51 non-null     float64
 3   family_members  51 non-null     float64
 4   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.4+ KB


`.shape` is a tuple with the number of rows and columns in the `DataFrame`. Note that `.shape` is an attribute rather than a method:

In [4]:
homelessness.shape

(51, 5)

Lastly `.describe()` gives summary statistics of the *numerical* columns:

In [5]:
homelessness.describe()

Unnamed: 0,individuals,family_members,state_pop
count,51.0,51.0,51.0
mean,7225.784314,3504.882353,6405637.0
std,15991.025083,7805.411811,7327258.0
min,434.0,75.0,577601.0
25%,1446.5,592.0,1777414.0
50%,3082.0,1482.0,4461153.0
75%,6781.5,3196.0,7340946.0
max,109008.0,52070.0,39461590.0


### Parts of a `DataFrame`

`.values` gives a "raw" NumPy array of the content of the `DataFrame`:

In [6]:
homelessness.values

array([['East South Central', 'Alabama', 2570.0, 864.0, 4887681],
       ['Pacific', 'Alaska', 1434.0, 582.0, 735139],
       ['Mountain', 'Arizona', 7259.0, 2606.0, 7158024],
       ['West South Central', 'Arkansas', 2280.0, 432.0, 3009733],
       ['Pacific', 'California', 109008.0, 20964.0, 39461588],
       ['Mountain', 'Colorado', 7607.0, 3250.0, 5691287],
       ['New England', 'Connecticut', 2280.0, 1696.0, 3571520],
       ['South Atlantic', 'Delaware', 708.0, 374.0, 965479],
       ['South Atlantic', 'District of Columbia', 3770.0, 3134.0, 701547],
       ['South Atlantic', 'Florida', 21443.0, 9587.0, 21244317],
       ['South Atlantic', 'Georgia', 6943.0, 2556.0, 10511131],
       ['Pacific', 'Hawaii', 4131.0, 2399.0, 1420593],
       ['Mountain', 'Idaho', 1297.0, 715.0, 1750536],
       ['East North Central', 'Illinois', 6752.0, 3891.0, 12723071],
       ['East North Central', 'Indiana', 3776.0, 1482.0, 6695497],
       ['West North Central', 'Iowa', 1711.0, 1038.0, 3148618]

`.columns` gives a pandas `Index` object for the columns of the `DataFrame`:

In [7]:
homelessness.columns

Index(['region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object')

`.index` gives an index for the rows, which will consist either of row numbers or row names:

In [8]:
homelessness.index

Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
      dtype='int64')

pandas `Index` objects are a subtle topic that will receive a lot more treatment later.

### Sorting rows

`.sort_values()` with one argument will sort a `DataFrame` by that column, ascending. For example:

In [9]:
homelessness.sort_values('individuals')

Unnamed: 0,region,state,individuals,family_members,state_pop
50,Mountain,Wyoming,434.0,205.0,577601
34,West North Central,North Dakota,467.0,75.0,758080
7,South Atlantic,Delaware,708.0,374.0,965479
39,New England,Rhode Island,747.0,354.0,1058287
45,New England,Vermont,780.0,511.0,624358
29,New England,New Hampshire,835.0,615.0,1353465
41,West North Central,South Dakota,836.0,323.0,878698
26,Mountain,Montana,983.0,422.0,1060665
48,South Atlantic,West Virginia,1021.0,222.0,1804291
24,East South Central,Mississippi,1024.0,328.0,2981020


Descending order can be achieved almost as easily with the `ascending` keyword argument:

In [10]:
homelessness.sort_values('family_members', ascending=False)

Unnamed: 0,region,state,individuals,family_members,state_pop
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
4,Pacific,California,109008.0,20964.0,39461588
21,New England,Massachusetts,6811.0,13257.0,6882635
9,South Atlantic,Florida,21443.0,9587.0,21244317
43,West South Central,Texas,19199.0,6111.0,28628666
47,Pacific,Washington,16424.0,5880.0,7523869
38,Mid-Atlantic,Pennsylvania,8163.0,5349.0,12800922
13,East North Central,Illinois,6752.0,3891.0,12723071
30,Mid-Atlantic,New Jersey,6048.0,3350.0,8886025
37,Pacific,Oregon,11139.0,3337.0,4181886


### Subsetting columns

Individual columns can be extracted as `Series` objects using square bracket indexing:

In [11]:
homelessness['individuals']

0       2570.0
1       1434.0
2       7259.0
3       2280.0
4     109008.0
5       7607.0
6       2280.0
7        708.0
8       3770.0
9      21443.0
10      6943.0
11      4131.0
12      1297.0
13      6752.0
14      3776.0
15      1711.0
16      1443.0
17      2735.0
18      2540.0
19      1450.0
20      4914.0
21      6811.0
22      5209.0
23      3993.0
24      1024.0
25      3776.0
26       983.0
27      1745.0
28      7058.0
29       835.0
30      6048.0
31      1949.0
32     39827.0
33      6451.0
34       467.0
35      6929.0
36      2823.0
37     11139.0
38      8163.0
39       747.0
40      3082.0
41       836.0
42      6139.0
43     19199.0
44      1904.0
45       780.0
46      3928.0
47     16424.0
48      1021.0
49      2740.0
50       434.0
Name: individuals, dtype: float64

`DataFrame`s can be subsetted using a `list` to index the original `DataFrame`:

In [12]:
homelessness[['state', 'family_members']]

Unnamed: 0,state,family_members
0,Alabama,864.0
1,Alaska,582.0
2,Arizona,2606.0
3,Arkansas,432.0
4,California,20964.0
5,Colorado,3250.0
6,Connecticut,1696.0
7,Delaware,374.0
8,District of Columbia,3134.0
9,Florida,9587.0


Note that this approach can be used to subset `DataFrame`s with one column, as opposed to extracting `Series` objects as described above:

In [13]:
homelessness[['state']]

Unnamed: 0,state
0,Alabama
1,Alaska
2,Arizona
3,Arkansas
4,California
5,Colorado
6,Connecticut
7,Delaware
8,District of Columbia
9,Florida


### Subsetting rows

Rows of a `DataFrame` can be subsetted using a `Series` of `bool`s. Here's a simple example:

In [14]:
homelessness[homelessness['individuals'] > 10_000]

Unnamed: 0,region,state,individuals,family_members,state_pop
4,Pacific,California,109008.0,20964.0,39461588
9,South Atlantic,Florida,21443.0,9587.0,21244317
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
37,Pacific,Oregon,11139.0,3337.0,4181886
43,West South Central,Texas,19199.0,6111.0,28628666
47,Pacific,Washington,16424.0,5880.0,7523869


Here's another example:

In [15]:
homelessness[homelessness['region'] == 'Mountain']

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259.0,2606.0,7158024
5,Mountain,Colorado,7607.0,3250.0,5691287
12,Mountain,Idaho,1297.0,715.0,1750536
26,Mountain,Montana,983.0,422.0,1060665
28,Mountain,Nevada,7058.0,486.0,3027341
31,Mountain,New Mexico,1949.0,602.0,2092741
44,Mountain,Utah,1904.0,972.0,3153550
50,Mountain,Wyoming,434.0,205.0,577601


Here's a more complicated example using two `Series` of `bools` joined with a Boolean "and" operation. **Note the necessity of parentheses and to use `&` instead of `and`**:

In [16]:
homelessness[(homelessness['family_members'] < 1000) & (homelessness['region'] == 'Pacific')]

Unnamed: 0,region,state,individuals,family_members,state_pop
1,Pacific,Alaska,1434.0,582.0,735139


### Subsetting rows by categorical variables

Multiple values of a categorical variable can be matched using the `.isin()` method of pandas `Series` objects:

In [17]:
# Subset the Mojave Desert states
homelessness[homelessness['state'].isin(['California', 'Arizona', 'Nevada', 'Utah'])]

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259.0,2606.0,7158024
4,Pacific,California,109008.0,20964.0,39461588
28,Mountain,Nevada,7058.0,486.0,3027341
44,Mountain,Utah,1904.0,972.0,3153550


### Adding new columns

Here, a `total` column is added to the `homelessness` `DataFrame`:

In [18]:
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']
homelessness

Unnamed: 0,region,state,individuals,family_members,state_pop,total
0,East South Central,Alabama,2570.0,864.0,4887681,3434.0
1,Pacific,Alaska,1434.0,582.0,735139,2016.0
2,Mountain,Arizona,7259.0,2606.0,7158024,9865.0
3,West South Central,Arkansas,2280.0,432.0,3009733,2712.0
4,Pacific,California,109008.0,20964.0,39461588,129972.0
5,Mountain,Colorado,7607.0,3250.0,5691287,10857.0
6,New England,Connecticut,2280.0,1696.0,3571520,3976.0
7,South Atlantic,Delaware,708.0,374.0,965479,1082.0
8,South Atlantic,District of Columbia,3770.0,3134.0,701547,6904.0
9,South Atlantic,Florida,21443.0,9587.0,21244317,31030.0


And here, a `p_homeless` column is added to give *per capita* numbers:

In [19]:
homelessness['p_homeless'] = homelessness['total'] / homelessness['state_pop']
homelessness

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_homeless
0,East South Central,Alabama,2570.0,864.0,4887681,3434.0,0.000703
1,Pacific,Alaska,1434.0,582.0,735139,2016.0,0.002742
2,Mountain,Arizona,7259.0,2606.0,7158024,9865.0,0.001378
3,West South Central,Arkansas,2280.0,432.0,3009733,2712.0,0.000901
4,Pacific,California,109008.0,20964.0,39461588,129972.0,0.003294
5,Mountain,Colorado,7607.0,3250.0,5691287,10857.0,0.001908
6,New England,Connecticut,2280.0,1696.0,3571520,3976.0,0.001113
7,South Atlantic,Delaware,708.0,374.0,965479,1082.0,0.001121
8,South Atlantic,District of Columbia,3770.0,3134.0,701547,6904.0,0.009841
9,South Atlantic,Florida,21443.0,9587.0,21244317,31030.0,0.001461


### Combo-attack!

The DataCamp instructor's whimsical name for an exercise combining the following operations:

* Adding a column `indiv_per_10k`, containing the number of homeless in each state per 10,000 members of the population
* Subsetting the rows where `indiv_per_10k` is higher than 20 (`high_homelessness`)
* Sorting `high_homelessness` by `indiv_per_10k` in descending order (`high_homelessness_srt`)
* Selecting only the `state` and `indiv_per_10k` columns of `high_homelessness_srt` and saving that subset as `result`

In [20]:
# Remove columns from previous exercise
homelessness = homelessness.drop(['total', 'p_homeless'], axis=1)

homelessness['indiv_per_10k'] = 10000 * homelessness['individuals'] / homelessness['state_pop']
homelessness

Unnamed: 0,region,state,individuals,family_members,state_pop,indiv_per_10k
0,East South Central,Alabama,2570.0,864.0,4887681,5.258117
1,Pacific,Alaska,1434.0,582.0,735139,19.506515
2,Mountain,Arizona,7259.0,2606.0,7158024,10.141067
3,West South Central,Arkansas,2280.0,432.0,3009733,7.575423
4,Pacific,California,109008.0,20964.0,39461588,27.623825
5,Mountain,Colorado,7607.0,3250.0,5691287,13.366045
6,New England,Connecticut,2280.0,1696.0,3571520,6.383837
7,South Atlantic,Delaware,708.0,374.0,965479,7.333148
8,South Atlantic,District of Columbia,3770.0,3134.0,701547,53.738381
9,South Atlantic,Florida,21443.0,9587.0,21244317,10.093523


In [21]:
high_homelessness = homelessness[homelessness['indiv_per_10k'] > 20]
high_homelessness

Unnamed: 0,region,state,individuals,family_members,state_pop,indiv_per_10k
4,Pacific,California,109008.0,20964.0,39461588,27.623825
8,South Atlantic,District of Columbia,3770.0,3134.0,701547,53.738381
11,Pacific,Hawaii,4131.0,2399.0,1420593,29.079406
28,Mountain,Nevada,7058.0,486.0,3027341,23.314189
32,Mid-Atlantic,New York,39827.0,52070.0,19530351,20.392363
37,Pacific,Oregon,11139.0,3337.0,4181886,26.636307
47,Pacific,Washington,16424.0,5880.0,7523869,21.829195


In [22]:
high_homelessness_srt = high_homelessness.sort_values('indiv_per_10k', ascending=False)
high_homelessness_srt

Unnamed: 0,region,state,individuals,family_members,state_pop,indiv_per_10k
8,South Atlantic,District of Columbia,3770.0,3134.0,701547,53.738381
11,Pacific,Hawaii,4131.0,2399.0,1420593,29.079406
4,Pacific,California,109008.0,20964.0,39461588,27.623825
37,Pacific,Oregon,11139.0,3337.0,4181886,26.636307
28,Mountain,Nevada,7058.0,486.0,3027341,23.314189
47,Pacific,Washington,16424.0,5880.0,7523869,21.829195
32,Mid-Atlantic,New York,39827.0,52070.0,19530351,20.392363


In [23]:
result = high_homelessness_srt[['state', 'indiv_per_10k']]
result

Unnamed: 0,state,indiv_per_10k
8,District of Columbia,53.738381
11,Hawaii,29.079406
4,California,27.623825
37,Oregon,26.636307
28,Nevada,23.314189
47,Washington,21.829195
32,New York,20.392363


## Aggregating `DataFrame`s

### Mean and median

First, let's have a quick overview of the `sales` `DataFrame`:

In [24]:
sales.head()

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
1,1,A,1,2010-03-05,21827.9,False,8.055556,0.693452,8.106
2,1,A,1,2010-04-02,57258.43,False,16.816667,0.718284,7.808
3,1,A,1,2010-05-07,17413.94,False,22.527778,0.748928,7.808
4,1,A,1,2010-06-04,17558.09,False,27.05,0.714586,7.808


Then let's take a look at some "info" for each of the columns 

In [25]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10774 entries, 0 to 10773
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   store                 10774 non-null  int64         
 1   type                  10774 non-null  object        
 2   department            10774 non-null  int64         
 3   date                  10774 non-null  datetime64[ns]
 4   weekly_sales          10774 non-null  float64       
 5   is_holiday            10774 non-null  bool          
 6   temperature_c         10774 non-null  float64       
 7   fuel_price_usd_per_l  10774 non-null  float64       
 8   unemployment          10774 non-null  float64       
dtypes: bool(1), datetime64[ns](1), float64(4), int64(2), object(1)
memory usage: 768.1+ KB


Now for some aggregation as promised. The mean of `weekly_sales`:

In [26]:
float(sales['weekly_sales'].mean())

23843.95014850566

And the median:

In [27]:
float(sales['weekly_sales'].median())

12049.064999999999

### Summarizing dates

Aggregate statistics on dates are also possible. DataCamp does some voodoo in the background on their lesson for this example which I've replicated in the notebook header by calling to `pd.to_datetime()`:

In [28]:
sales['date'].max()

Timestamp('2012-10-26 00:00:00')

In [29]:
sales['date'].min()

Timestamp('2010-02-05 00:00:00')

### Efficient summaries

Custom aggregation functions are also possible to go beyond built-in functions. Here's an example with interquartile range:

In [30]:
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)


float(sales['temperature_c'].agg(iqr))

16.583333333333336

Aggregation can also be applied to a `DataFrame` with multiple columns rather than just one `Series`:

In [31]:
sales[['temperature_c', 'fuel_price_usd_per_l', 'unemployment']].agg(iqr)

temperature_c           16.583333
fuel_price_usd_per_l     0.073176
unemployment             0.565000
dtype: float64

In the same way, multiple functions can be given to `.agg()` (the DataCamp code used `np.median` but `pd.Series.median` doesn't raise a warning):

In [32]:
sales[['temperature_c', 'fuel_price_usd_per_l', 'unemployment']].agg([iqr, pd.Series.median])

Unnamed: 0,temperature_c,fuel_price_usd_per_l,unemployment
iqr,16.583333,0.073176,0.565
median,16.966667,0.743381,8.099


### Cumulative statistics

To start with, we'll need a subset of `sales` that contains only rows where `store` is `1` and `department` is also `1`:

In [33]:
sales_1_1 = sales[(sales['store'] == 1) & (sales['department'] == 1)]
sales.head()

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
1,1,A,1,2010-03-05,21827.9,False,8.055556,0.693452,8.106
2,1,A,1,2010-04-02,57258.43,False,16.816667,0.718284,7.808
3,1,A,1,2010-05-07,17413.94,False,22.527778,0.748928,7.808
4,1,A,1,2010-06-04,17558.09,False,27.05,0.714586,7.808


Cumulative statistics will also require that `sales_1_1` be sorted by `date` ascending. The data are already in order but it can't help to be thorough:

In [34]:
sales_1_1 = sales_1_1.sort_values('date')
sales_1_1.head()

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
1,1,A,1,2010-03-05,21827.9,False,8.055556,0.693452,8.106
2,1,A,1,2010-04-02,57258.43,False,16.816667,0.718284,7.808
3,1,A,1,2010-05-07,17413.94,False,22.527778,0.748928,7.808
4,1,A,1,2010-06-04,17558.09,False,27.05,0.714586,7.808


Now we'll add cumulative sum and cumulative maximum columns drawn from `weekly_sales`:

In [35]:
sales_1_1['cum_weekly_sales'] = sales_1_1['weekly_sales'].cumsum()
sales_1_1['cum_max_sales'] = sales_1_1['weekly_sales'].cummax()
sales_1_1[['date', 'weekly_sales', 'cum_weekly_sales', 'cum_max_sales']]

Unnamed: 0,date,weekly_sales,cum_weekly_sales,cum_max_sales
0,2010-02-05,24924.5,24924.5,24924.5
1,2010-03-05,21827.9,46752.4,24924.5
2,2010-04-02,57258.43,104010.83,57258.43
3,2010-05-07,17413.94,121424.77,57258.43
4,2010-06-04,17558.09,138982.86,57258.43
5,2010-07-02,16333.14,155316.0,57258.43
6,2010-08-06,17508.41,172824.41,57258.43
7,2010-09-03,16241.78,189066.19,57258.43
8,2010-10-01,20094.19,209160.38,57258.43
9,2010-11-05,34238.88,243399.26,57258.43


### Dropping duplicates

Removing duplicates can be an important step in analyzing categorical data. Here, `store_types` and `store_departments` represent unique combinations of `store` and `type` and `store` and `department`, respectively, in the `sales` dataset:

In [36]:
store_types = sales.drop_duplicates(['store', 'type'])
store_types

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
901,2,A,1,2010-02-05,35034.06,False,4.55,0.679451,8.324
1798,4,A,1,2010-02-05,38724.42,False,6.533333,0.686319,8.623
2699,6,A,1,2010-02-05,25619.0,False,4.683333,0.679451,7.259
3593,10,B,1,2010-02-05,40212.84,False,12.411111,0.782478,9.765
4495,13,A,1,2010-02-05,46761.9,False,-0.261111,0.704283,8.316
5408,14,A,1,2010-02-05,32842.31,False,-2.605556,0.735455,8.992
6293,19,A,1,2010-02-05,21500.58,False,-6.133333,0.780365,8.35
7199,20,A,1,2010-02-05,46021.21,False,-3.377778,0.735455,8.187
8109,27,A,1,2010-02-05,32313.79,False,-2.672222,0.780365,8.237


In [37]:
store_depts = sales.drop_duplicates(['store', 'department'])
store_depts

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
12,1,A,2,2010-02-05,50605.27,False,5.727778,0.679451,8.106
24,1,A,3,2010-02-05,13740.12,False,5.727778,0.679451,8.106
36,1,A,4,2010-02-05,39954.04,False,5.727778,0.679451,8.106
48,1,A,5,2010-02-05,32229.38,False,5.727778,0.679451,8.106
60,1,A,6,2010-02-05,5749.03,False,5.727778,0.679451,8.106
72,1,A,7,2010-02-05,21084.08,False,5.727778,0.679451,8.106
84,1,A,8,2010-02-05,40129.01,False,5.727778,0.679451,8.106
96,1,A,9,2010-02-05,16930.99,False,5.727778,0.679451,8.106
108,1,A,10,2010-02-05,30721.5,False,5.727778,0.679451,8.106


Perhaps a more interesting/obvious transformation is getting all the unique holiday dates from the `sales` dataset:

In [38]:
holiday_dates = sales[sales['is_holiday'] == True].drop_duplicates('date').sort_values('date')
holiday_dates['date']

2315   2010-02-12
498    2010-09-10
6810   2010-12-31
6820   2011-09-09
691    2011-11-25
6815   2012-02-10
6735   2012-09-07
Name: date, dtype: datetime64[ns]

### Counting categorical variables

Here are a few ways that categorical data can be counted, based on the two `DataFrame`s ensuring uniqueness calculated previously:

In [39]:
store_counts = store_types['type'].value_counts()
store_counts

type
A    11
B     1
Name: count, dtype: int64

`normalize=True` does proportions rather than raw counts:

In [40]:
store_props = store_types['type'].value_counts(normalize=True)
store_props

type
A    0.916667
B    0.083333
Name: proportion, dtype: float64

`sort=True` (on by default, at least as of pandas 2.2.3):

In [41]:
dept_counts_sorted = store_depts['department'].value_counts(sort=True)
dept_counts_sorted

department
1     12
2     12
3     12
4     12
5     12
6     12
7     12
8     12
9     12
10    12
11    12
12    12
13    12
14    12
16    12
17    12
18    12
19    12
20    12
21    12
22    12
23    12
24    12
25    12
26    12
27    12
28    12
29    12
30    12
31    12
32    12
33    12
34    12
35    12
36    12
38    12
41    12
40    12
42    12
44    12
49    12
45    12
46    12
47    12
52    12
51    12
55    12
54    12
83    12
85    12
56    12
58    12
59    12
60    12
67    12
71    12
72    12
74    12
77    12
78    12
79    12
80    12
81    12
82    12
95    12
96    12
87    12
90    12
91    12
92    12
93    12
94    12
97    12
98    12
99    11
37    10
48     8
50     6
39     4
43     2
Name: count, dtype: int64

`sort` and `normalize` can of course be combined:

In [42]:
dept_props_sorted = store_depts['department'].value_counts(sort=True, normalize=True)
dept_props_sorted

department
1     0.012917
2     0.012917
3     0.012917
4     0.012917
5     0.012917
6     0.012917
7     0.012917
8     0.012917
9     0.012917
10    0.012917
11    0.012917
12    0.012917
13    0.012917
14    0.012917
16    0.012917
17    0.012917
18    0.012917
19    0.012917
20    0.012917
21    0.012917
22    0.012917
23    0.012917
24    0.012917
25    0.012917
26    0.012917
27    0.012917
28    0.012917
29    0.012917
30    0.012917
31    0.012917
32    0.012917
33    0.012917
34    0.012917
35    0.012917
36    0.012917
38    0.012917
41    0.012917
40    0.012917
42    0.012917
44    0.012917
49    0.012917
45    0.012917
46    0.012917
47    0.012917
52    0.012917
51    0.012917
55    0.012917
54    0.012917
83    0.012917
85    0.012917
56    0.012917
58    0.012917
59    0.012917
60    0.012917
67    0.012917
71    0.012917
72    0.012917
74    0.012917
77    0.012917
78    0.012917
79    0.012917
80    0.012917
81    0.012917
82    0.012917
95    0.012917
96    0.012917

### What percent of sales occurred at each store type?

Here it is possible to do a summary for each of the three types of store manually then normalize using the sum of all values of `weekly_sales`. I was initially confused that a `list` could be divided by what one might call a "scalar" value, to borrow from Perl terminology. The reason it works is because `sales_all` is an `np.float64`. I am not entirely sure how it works under the hood.

In [43]:
sales_all = sales['weekly_sales'].sum()
sales_A = sales[sales['type'] == "A"]['weekly_sales'].sum()
sales_B = sales[sales['type'] == "B"]['weekly_sales'].sum()
sales_C = sales[sales['type'] == "C"]['weekly_sales'].sum()

sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
sales_propn_by_type

array([0.9098, 0.0902, 0.    ])

### Calculations with .groupby()

Strictly speaking, the above *works*, but it is brittle (values of `type` hard-coded) and tedious and error-prone (violation of the DRY principle). Writing a `for` loop would obviate much of the downsides to the above approach but pandas can do us one better by offering `.groupby()`:

In [44]:
sales_by_type = sales.groupby('type')['weekly_sales'].sum()
sales_by_type

type
A    2.337163e+08
B    2.317840e+07
Name: weekly_sales, dtype: float64

In [45]:
sales_propn_by_type = sales_by_type / sales['weekly_sales'].sum()
sales_propn_by_type

type
A    0.909775
B    0.090225
Name: weekly_sales, dtype: float64

### Multiple grouped summaries

It is possible moreover to use multiple variables after a `.groupby()` operation. Here we group by `type` and compute several aggregate statistics on `weekly_sales` as an example of summarizing only variable:

In [46]:
sales_stats = sales.groupby('type')['weekly_sales'].agg([pd.Series.min, pd.Series.max, pd.Series.mean, pd.Series.median])
sales_stats

Unnamed: 0_level_0,min,max,mean,median
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,-1098.0,293966.05,23674.667242,11943.92
B,-798.0,232558.51,25696.67837,13336.08


In this instance, however, the aggregation is done over `unemployment` and `fuel_price_usd_per_l`:

In [47]:
unemp_fuel_stats = sales.groupby('type')[['unemployment', 'fuel_price_usd_per_l']].agg([
    pd.Series.min,
    pd.Series.max,
    pd.Series.mean,
    pd.Series.median
])
unemp_fuel_stats

Unnamed: 0_level_0,unemployment,unemployment,unemployment,unemployment,fuel_price_usd_per_l,fuel_price_usd_per_l,fuel_price_usd_per_l,fuel_price_usd_per_l
Unnamed: 0_level_1,min,max,mean,median,min,max,mean,median
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
A,3.879,8.992,7.972611,8.067,0.664129,1.10741,0.744619,0.735455
B,7.17,9.765,9.279323,9.199,0.760023,1.107674,0.805858,0.803348


### Pivoting on one variable

pandas can even emulate the "pivot table" functionality of many spreadsheet software programs. Here, `index` is the variable to group by and `weekly_sales` is the variable of interest to aggregate after grouping. Note that the default aggregation is the arithmetic mean:

In [48]:
mean_sales_by_type = sales.pivot_table(values='weekly_sales', index='type')
mean_sales_by_type

Unnamed: 0_level_0,weekly_sales
type,Unnamed: 1_level_1
A,23674.667242
B,25696.67837


And here, two aggregation functions are explicitly applied. They can be named with strings instead of with explicit method names, which is nice:

In [49]:
mean_med_sales_by_type = sales.pivot_table(values='weekly_sales', index='type', aggfunc=['mean', 'median'])
mean_med_sales_by_type

Unnamed: 0_level_0,mean,median
Unnamed: 0_level_1,weekly_sales,weekly_sales
type,Unnamed: 1_level_2,Unnamed: 2_level_2
A,23674.667242,11943.92
B,25696.67837,13336.08


A second dimension can be introduced with the addition of the `columns` argument:

In [50]:
mean_sales_by_type_holiday = sales.pivot_table(values='weekly_sales', index='type', columns='is_holiday')
mean_sales_by_type_holiday

is_holiday,False,True
type,Unnamed: 1_level_1,Unnamed: 2_level_1
A,23768.583523,590.04525
B,25751.980533,810.705


### Fill in missing values and sum values with pivot tables

Rather than just have a bunch of `NaN` values, `fill_value` can be set to `0` when appropriate:

In [51]:
sales.pivot_table(values='weekly_sales', index='department', columns='type', fill_value=0)

type,A,B
department,Unnamed: 1_level_1,Unnamed: 2_level_1
1,30961.725379,44050.626667
2,67600.158788,112958.526667
3,17160.002955,30580.655
4,44285.399091,51219.654167
5,34821.011364,63236.875
6,7136.292652,10717.2975
7,38454.336818,52909.653333
8,48583.475303,90733.753333
9,30120.449924,66679.301667
10,30930.456364,48595.126667


`.pivot_table()` can also calculate partial group aggregates on the margins with `margins=True` (again the default aggregate function is the arithmetic mean):

In [53]:
sales.pivot_table(values='weekly_sales', index='department', columns='type', fill_value=0, margins=True)

type,A,B,All
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,30961.725379,44050.626667,32052.467153
2,67600.158788,112958.526667,71380.022778
3,17160.002955,30580.655,18278.390625
4,44285.399091,51219.654167,44863.253681
5,34821.011364,63236.875,37189.0
6,7136.292652,10717.2975,7434.709722
7,38454.336818,52909.653333,39658.946528
8,48583.475303,90733.753333,52095.998472
9,30120.449924,66679.301667,33167.020903
10,30930.456364,48595.126667,32402.512222


## Slicing and Indexing `DataFrame`s