# Manipulating DataFrames with pandas

- Extracting, filtering, and transforming data from DataFrames
- Advanced indexing with multiple levels
- Tidying, rearranging and restructuring your data
- Pivoting, melting, and stacking DataFrames
- Identifying and spli!ing DataFrames by groups

In [12]:
import pandas as pd

df = pd.read_csv('sampledata.csv')

df.head()

Unnamed: 0,ITEM_NO,AMC Item Number,ITEM_DESC,Logic Count,Seafrigo Count,Discrepancy (cases),Discrepancy Value,ACTIVE/INACTIVE
0,44207509,L044207509,"DRAWER, DELTA VARIO ATLAS 120MM",1335,2235,900,"$25,945.19",False
1,44207668,L044207668,"TOILET PAPER, COTTENELLE",445,985,540,"$17,813.36",False
2,44207958,L044207958,"LINER, TRASH CAN, DOM SM",259,759,500,"$14,445.00",False
3,44208068,L044208068,"BLANKET, YC (G&G)",371,811,440,"$94,467.70",False
4,44207583,L044207583,"FOAMING SOAP, SQUARE M+G 5.4 oz",523,885,362,"$63,712.00",False


In [6]:
df.columns

Index([u'ITEM_NO', u' AMC Item Number ', u'ITEM_DESC', u'Logic Count',
       u'Seafrigo Count', u'Discrepancy (cases)', u' Discrepancy Value ',
       u'ACTIVE/INACTIVE'],
      dtype='object')

### Index ordering

In [13]:
df.loc[1, 'ITEM_NO']

'44207668'

### Positional and labeled indexing

Given a pair of label-based indices, sometimes it's necessary to find the corresponding positions.

In [15]:
print(df.iloc[3, 4])

811


### Indexing and column rearrangement

In [16]:
# Read in filename and set the index: election
election = pd.read_csv('sampledata.csv', index_col='ITEM_NO')

# Create a separate dataframe with the columns ['winner', 'total', 'voters']: results
results = pd.DataFrame(election, columns=['ITEM_DESC', 'Logic Count', 'Seafrigo Count'])

# Print the output of results.head()
print(results.head())

                                ITEM_DESC  Logic Count  Seafrigo Count
ITEM_NO                                                               
44207509  DRAWER, DELTA VARIO ATLAS 120MM         1335            2235
44207668         TOILET PAPER, COTTENELLE          445             985
44207958         LINER, TRASH CAN, DOM SM          259             759
44208068                BLANKET, YC (G&G)          371             811
44207583  FOAMING SOAP, SQUARE M+G 5.4 oz          523             885


### Slicing DataFrames

### Slicing rows

In [17]:
# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['44207509':'44208068':]

# Print the p_counties DataFrame
print(p_counties)

# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc['44208068':'44207509':-1]

# Print the p_counties_rev DataFrame
print(p_counties_rev)

          AMC Item Number                         ITEM_DESC  Logic Count  \
ITEM_NO                                                                    
44207509       L044207509   DRAWER, DELTA VARIO ATLAS 120MM         1335   
44207668       L044207668          TOILET PAPER, COTTENELLE          445   
44207958       L044207958          LINER, TRASH CAN, DOM SM          259   
44208068       L044208068                 BLANKET, YC (G&G)          371   

          Seafrigo Count  Discrepancy (cases)  Discrepancy Value   \
ITEM_NO                                                             
44207509            2235                  900         $25,945.19    
44207668             985                  540         $17,813.36    
44207958             759                  500         $14,445.00    
44208068             811                  440         $94,467.70    

         ACTIVE/INACTIVE  
ITEM_NO                   
44207509           False  
44207668           False  
44207958           F

### Slicing columns

In [18]:
# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:,:'Seafrigo Count']

# Print the output of left_columns.head()
print(left_columns.head(5))

# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:,'Logic Count':'Seafrigo Count']

# Print the output of middle_columns.head()
print(middle_columns.head(5))

# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:,'Logic Count':]

# Print the output of right_columns.head()
print(right_columns.head(5))

          AMC Item Number                         ITEM_DESC  Logic Count  \
ITEM_NO                                                                    
44207509       L044207509   DRAWER, DELTA VARIO ATLAS 120MM         1335   
44207668       L044207668          TOILET PAPER, COTTENELLE          445   
44207958       L044207958          LINER, TRASH CAN, DOM SM          259   
44208068       L044208068                 BLANKET, YC (G&G)          371   
44207583       L044207583   FOAMING SOAP, SQUARE M+G 5.4 oz          523   

          Seafrigo Count  
ITEM_NO                   
44207509            2235  
44207668             985  
44207958             759  
44208068             811  
44207583             885  
          Logic Count  Seafrigo Count
ITEM_NO                              
44207509         1335            2235
44207668          445             985
44207958          259             759
44208068          371             811
44207583          523             885
          Lo

### Subselecting DataFrames with lists

You can use lists to select specific row and column labels with the .loc[] accessor

In [19]:
# Create the list of row labels: rows
rows = ['44207583', '44207668', '44207958']

# Create the list of column labels: cols
cols = ['Seafrigo Count', 'Logic Count', 'ACTIVE/INACTIVE']

# Create the new DataFrame: three_counties
three_counties = election.loc[rows,cols]

# Print the three_counties DataFrame
print(three_counties)

          Seafrigo Count  Logic Count ACTIVE/INACTIVE
ITEM_NO                                              
44207583             885          523           False
44207668             985          445           False
44207958             759          259           False


## Filtering DataFrames

### Thresholding data

In [22]:
# Create the boolean array: high_turnout
high_turnout = election[' Discrepancy Value '] > 70

# Filter the election DataFrame with the high_turnout array: high_turnout_df
high_turnout_df = election[high_turnout]

# Print the high_turnout_results DataFrame
print(high_turnout_df)

              AMC Item Number                                 ITEM_DESC  \
ITEM_NO                                                                   
44207509           L044207509           DRAWER, DELTA VARIO ATLAS 120MM   
44207668           L044207668                  TOILET PAPER, COTTENELLE   
44207958           L044207958                  LINER, TRASH CAN, DOM SM   
44208068           L044208068                         BLANKET, YC (G&G)   
44207583           L044207583           FOAMING SOAP, SQUARE M+G 5.4 oz   
44207984           L044207984                             STICK, TRUVIA   
44206683           L044206683            TOWEL, CLEANING WIPE (CS ONLY)   
44206359           L044206359                ATLAS FULL MEALTRAY, WHITE   
44207770           L044207770                          TRAY, ATLAS FULL   
44207015           L044207015        SWEETENER, PINK, STIX w/DELTA LOGO   
44206821           L044206821           BAG, BLACK SML PLSTC 30x37 1MIL   
44207643           L04420

In [21]:
election.columns

Index([u' AMC Item Number ', u'ITEM_DESC', u'Logic Count', u'Seafrigo Count',
       u'Discrepancy (cases)', u' Discrepancy Value ', u'ACTIVE/INACTIVE'],
      dtype='object')

### Filtering columns using other columns

In [23]:
# Import numpy
import numpy as np

# Create the boolean array: too_close
too_close = election[' Discrepancy Value '] < 10

# Assign np.nan to the 'winner' column where the results were too close to call
election['ACTIVE/INACTIVE'][too_close] = np.nan

# Print the output of election.info()
print(election.info())

<class 'pandas.core.frame.DataFrame'>
Index: 1344 entries, 44207509 to 44206947
Data columns (total 7 columns):
 AMC Item Number       1344 non-null object
ITEM_DESC              1344 non-null object
Logic Count            1344 non-null int64
Seafrigo Count         1344 non-null int64
Discrepancy (cases)    1344 non-null int64
 Discrepancy Value     1344 non-null object
ACTIVE/INACTIVE        1312 non-null object
dtypes: int64(3), object(4)
memory usage: 124.0+ KB
None


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


### Filtering using NaNs

In certain scenarios, it may be necessary to remove rows and columns with missing data from a DataFrame. The .dropna() method is used to perform this action.

also use the .shape attribute, which returns the number of rows and columns in a tuple from a DataFrame, or the number of rows from a Series, to see the effect of dropping missing values from a DataFrame.

Finally, you'll use the thresh= keyword argument to drop columns from the full dataset that have more than 1000 missing values.

In [25]:
# Select the 'age' and 'cabin' columns: df
df = election[['Logic Count','Seafrigo Count']]

# Print the shape of df
print(df.shape)

# Drop rows in df with how='any' and print the shape
print(df.dropna(how='any').shape)

# Drop rows in df with how='all' and print the shape
print(df.dropna(how='all').shape)

# Call .dropna() with thresh=1000 and axis='columns' and print the output of .info() from titanic
print(election.dropna(thresh=1000, axis='columns').info())

(1344, 2)
(1344, 2)
(1344, 2)
<class 'pandas.core.frame.DataFrame'>
Index: 1344 entries, 44207509 to 44206947
Data columns (total 7 columns):
 AMC Item Number       1344 non-null object
ITEM_DESC              1344 non-null object
Logic Count            1344 non-null int64
Seafrigo Count         1344 non-null int64
Discrepancy (cases)    1344 non-null int64
 Discrepancy Value     1344 non-null object
ACTIVE/INACTIVE        1312 non-null object
dtypes: int64(3), object(4)
memory usage: 124.0+ KB
None


## Transforming DataFrames

### Using apply() to transform a column

The .apply() method can be used on a pandas DataFrame to apply an arbitrary Python function to every element

### Using .map() with a dictionary

The .map() method is used to transform values according to a Python dictionary look-up

### Using vectorized functions
When performance is paramount, you should avoid using .apply() and .map() because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame. By using vectorized functions instead, you can loop over the data at the same speed as compiled code (C, Fortran, etc.)! NumPy, SciPy and pandas come with a variety of vectorized functions (called Universal Functions or UFuncs in NumPy).

You can even write your own vectorized functions

## Advanced indexing

### Index objects and labeled data
##### pandas Data Structures
- Key building blocks
 - Indexes: Sequence of labels
 - Series: 1D array with Index
 - DataFrames: 2D array with Series as columns 
- Indexes
 - Immutable (Like dictionary keys) 
 - Homogenous in data type (Like NumPy arrays)

### Changing index of a DataFrame

In [19]:
import pandas as pd

auto = pd.read_csv('auto.csv', index_col='origin ')

auto.head()

Unnamed: 0_level_0,mpg,cyl,displ,hp,weight,accel,yr,name,color,size,maker
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
US,18.0,6,250,88,3139,14.5,71,ford mustang,red,27.370336,o
US,9.0,8,304,193,4732,18.5,70,hi 1200d,green,62.199511,o
Asia,36.1,4,91,60,1800,16.4,78,honda civic cvcc,blue,9.0,x
US,18.5,6,250,98,3525,19.0,77,ford granada,red,34.515625,o
Europe,34.3,4,97,78,2188,15.8,80,audi 4000,blue,13.298178,s


In [20]:
# Create the list of new indexes: new_idx
new_idx = [x.lower() for x in auto.index]

# Assign new_idx to sales.index
auto.index = new_idx

# Print the sales DataFrame
print(auto.head())

         mpg  cyl  displ   hp  weight  accel  yr              name  color  \
us      18.0    6    250   88    3139   14.5  71      ford mustang    red   
us       9.0    8    304  193    4732   18.5  70          hi 1200d  green   
asia    36.1    4     91   60    1800   16.4  78  honda civic cvcc   blue   
us      18.5    6    250   98    3525   19.0  77      ford granada    red   
europe  34.3    4     97   78    2188   15.8  80         audi 4000   blue   

             size maker  
us      27.370336     o  
us      62.199511     o  
asia     9.000000     x  
us      34.515625     o  
europe  13.298178     s  


### Changing index name labels

In [24]:
# Assign the string 'MONTHS' to sales.index.name
auto.index.name = 'CONTINENTS'

# Print the sales DataFrame
print(auto.head())

# Assign the string 'PRODUCTS' to sales.columns.name 
auto.columns.name = 'FEATURES'

# Print the sales dataframe again
print(auto.head())

             mpg  cyl  displ   hp  weight  accel  yr              name  color  \
CONTINENTS                                                                      
us          18.0    6    250   88    3139   14.5  71      ford mustang    red   
us           9.0    8    304  193    4732   18.5  70          hi 1200d  green   
asia        36.1    4     91   60    1800   16.4  78  honda civic cvcc   blue   
us          18.5    6    250   98    3525   19.0  77      ford granada    red   
europe      34.3    4     97   78    2188   15.8  80         audi 4000   blue   

                 size maker  
CONTINENTS                   
us          27.370336     o  
us          62.199511     o  
asia         9.000000     x  
us          34.515625     o  
europe      13.298178     s  
FEATURES     mpg  cyl  displ   hp  weight  accel  yr              name  color  \
CONTINENTS                                                                      
us          18.0    6    250   88    3139   14.5  71      fo

### Building an index, then a DataFrame

You can also build the DataFrame and index independently, and then put them together. If you take this route, be careful, as any mistakes in generating the DataFrame or the index can cause the data and the index to be aligned incorrectly

In [26]:
# Generate the list of months: months
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun','Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun','Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun','Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun','Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']

# Assign months to sales.index
auto.index = months

# Print the modified sales DataFrame
print(auto.head())

FEATURES   mpg  cyl  displ   hp  weight  accel  yr              name  color  \
Jan       18.0    6    250   88    3139   14.5  71      ford mustang    red   
Feb        9.0    8    304  193    4732   18.5  70          hi 1200d  green   
Mar       36.1    4     91   60    1800   16.4  78  honda civic cvcc   blue   
Apr       18.5    6    250   98    3525   19.0  77      ford granada    red   
May       34.3    4     97   78    2188   15.8  80         audi 4000   blue   

FEATURES       size maker  
Jan       27.370336     o  
Feb       62.199511     o  
Mar        9.000000     x  
Apr       34.515625     o  
May       13.298178     s  


## Hierarchical indexing

### Extracting data with a MultiIndex

Extracting elements from the outermost level of a MultiIndex is just like in the case of a single-level Index. You can use the .loc[] accessor

### Setting & sorting a MultiIndex

With a MultiIndex, you should always ensure the index is sorted. You can skip this only if you know the data is already sorted on the index fields.

### Using .loc[] with nonunique indexes

it is always preferable to have a meaningful index that uniquely identifies each row. Even though pandas does not require unique index values in DataFrames, it works better if the index values are indeed unique. 

### Indexing multiple levels of a MultiIndex

Looking up indexed data is fast and efficient. And you have already seen that lookups based on the outermost level of a MultiIndex work just like lookups on DataFrames that have a single-level Index.

Looking up data based on inner levels of a MultiIndex can be a bit trickier. In this exercise, you will use your sales DataFrame to do some increasingly complex lookups.

The trickiest of all these lookups are when you want to access some inner levels of the index. In this case, you need to use slice(None) in the slicing parameter for the outermost dimension(s) instead of the usual :, or use pd.IndexSlice. 

stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :]

Pay particular attention to the tuple (slice(None), slice('2016-10-03', '2016-10-04')).

## Pivoting DataFrames

### Pivoting a single variable

### Pivoting all variables

If you do not select any particular variables, all of them will be pivoted. In this case - with the users DataFrame - both 'visitors' and 'signups' will be pivoted, creating hierarchical column labels.

### Stacking & unstacking DataFrames

In [42]:
# Unstack users by 'weekday': byweekday
byweekday = auto.unstack('maker')

# Print the byweekday DataFrame
print(byweekday.head())

# Stack byweekday by 'weekday' and print it
print(auto.stack(level='FEATURES').head())

FEATURES     
mpg       Jan      18
          Feb       9
          Mar    36.1
          Apr    18.5
          May    34.3
dtype: object
     FEATURES
Jan  mpg           18
     cyl            6
     displ        250
     hp            88
     weight      3139
dtype: object


### Restoring the index order

You will now use .swaplevel(0, 1) to flip the index levels. Note they won't be sorted. To sort them, you will have to follow up with a .sort_index(). You will then obtain the original DataFrame. Note that an unsorted index leads to slicing failures.

In [49]:
# Stack 'city' back into the index of bycity: newusers
newusers = auto.stack(level='FEATURES')

# Swap the levels of the index of newusers: newusers
newusers = newusers.swaplevel(0,1)

# Print newusers and verify that the index is not sorted
print(newusers)

# Sort the index of newusers: newusers
newusers = newusers.sort_index()

# Print newusers and verify that the index is now sorted
print(newusers.head())

# Verify that the new DataFrame is equal to the original
print(newusers.equals(auto))

FEATURES     
mpg       Jan                       18
cyl       Jan                        6
displ     Jan                      250
hp        Jan                       88
weight    Jan                     3139
accel     Jan                     14.5
yr        Jan                       71
name      Jan             ford mustang
color     Jan                      red
size      Jan                  27.3703
maker     Jan                        o
mpg       Feb                        9
cyl       Feb                        8
displ     Feb                      304
hp        Feb                      193
weight    Feb                     4732
accel     Feb                     18.5
yr        Feb                       70
name      Feb                 hi 1200d
color     Feb                    green
size      Feb                  62.1995
maker     Feb                        o
mpg       Mar                     36.1
cyl       Mar                        4
displ     Mar                       91
hp        M

## Melting DataFrames

### Adding names for readability

In [54]:
# Reset the index: visitors_by_city_weekday
visitors_by_city_weekday = auto.reset_index() 

# Print visitors_by_city_weekday
print(visitors_by_city_weekday.head())

# Melt visitors_by_city_weekday: visitors
visitors = pd.melt(visitors_by_city_weekday, id_vars=['mpg'], value_name='yr')

# Print visitors
print(visitors.head())

FEATURES index   mpg  cyl  displ   hp  weight  accel  yr              name  \
0          Jan  18.0    6    250   88    3139   14.5  71      ford mustang   
1          Feb   9.0    8    304  193    4732   18.5  70          hi 1200d   
2          Mar  36.1    4     91   60    1800   16.4  78  honda civic cvcc   
3          Apr  18.5    6    250   98    3525   19.0  77      ford granada   
4          May  34.3    4     97   78    2188   15.8  80         audi 4000   

FEATURES  color       size maker  
0           red  27.370336     o  
1         green  62.199511     o  
2          blue   9.000000     x  
3           red  34.515625     o  
4          blue  13.298178     s  
    mpg FEATURES   yr
0  18.0    index  Jan
1   9.0    index  Feb
2  36.1    index  Mar
3  18.5    index  Apr
4  34.3    index  May


### Going from wide to long

You can move multiple columns into a single column (making the data long and skinny) by "melting" multiple columns. In this exercise, you will practice doing this.

In [55]:
# Melt users: skinny
skinny = pd.melt(auto, id_vars=['name', 'maker'], value_vars=['yr','hp'])

# Print skinny
print(skinny)

                                name maker FEATURES  value
0                       ford mustang     o       yr     71
1                           hi 1200d     o       yr     70
2                   honda civic cvcc     x       yr     78
3                       ford granada     o       yr     77
4                          audi 4000     s       yr     80
5                       datsun 200sx     x       yr     81
6                     toyota corolla     x       yr     80
7                volkswagen 411 (sw)     s       yr     72
8            mercury cougar brougham     o       yr     77
9                        ford torino     o       yr     70
10                         vw pickup     s       yr     82
11             pontiac sunbird coupe     o       yr     77
12                     dodge rampage     o       yr     82
13                          ford ltd     o       yr     75
14             chevrolet monte carlo     o       yr     70
15  chevrolet chevelle concours (sw)     o       yr     

### Obtaining key-value pairs with melt()

Sometimes, all you need is some key-value pairs, and the context does not matter. If said context is in the index

In [57]:
# Set the new index: users_idx
users_idx = auto.set_index(['maker','yr'])

# Print the users_idx DataFrame
print(users_idx.head())

# Obtain the key-value pairs: kv_pairs
kv_pairs = pd.melt(users_idx,col_level=0)

# Print the key-value pairs
print(kv_pairs.head())

FEATURES   mpg  cyl  displ   hp  weight  accel              name  color  \
maker yr                                                                  
o     71  18.0    6    250   88    3139   14.5      ford mustang    red   
      70   9.0    8    304  193    4732   18.5          hi 1200d  green   
x     78  36.1    4     91   60    1800   16.4  honda civic cvcc   blue   
o     77  18.5    6    250   98    3525   19.0      ford granada    red   
s     80  34.3    4     97   78    2188   15.8         audi 4000   blue   

FEATURES       size  
maker yr             
o     71  27.370336  
      70  62.199511  
x     78   9.000000  
o     77  34.515625  
s     80  13.298178  
  FEATURES value
0      mpg    18
1      mpg     9
2      mpg  36.1
3      mpg  18.5
4      mpg  34.3


## Pivot tables

### Setting up a pivot table

In [60]:
# Create the DataFrame with the appropriate pivot table: by_city_day
by_city_day = auto.pivot_table(index='name',columns='yr')

# Print by_city_day
print(by_city_day.head())

FEATURES                         accel                                        \
yr                                  70  71    72  73  74    75  77  78    80   
name                                                                           
audi 4000                          NaN NaN   NaN NaN NaN   NaN NaN NaN  15.8   
buick century                      NaN NaN   NaN NaN NaN  21.0 NaN NaN   NaN   
buick skyhawk                      NaN NaN   NaN NaN NaN  15.0 NaN NaN   NaN   
chevrolet chevelle concours (sw)   NaN NaN  14.0 NaN NaN   NaN NaN NaN   NaN   
chevrolet monte carlo              9.5 NaN   NaN NaN NaN   NaN NaN NaN   NaN   

FEATURES                             ... weight                              \
yr                                81 ...     71      72  73  74      75  77   
name                                 ...                                      
audi 4000                        NaN ...    NaN     NaN NaN NaN     NaN NaN   
buick century                    NaN ...   

### Using other aggregations in pivot tables

You can also use aggregation functions with in a pivot table by specifying the aggfunc parameter.

In [61]:
# Use a pivot table to display the count of each column: count_by_weekday1
count_by_weekday1 = auto.pivot_table(index='name',aggfunc='count')

# Print count_by_weekday
print(count_by_weekday1.head())

# Replace 'aggfunc='count'' with 'aggfunc=len': count_by_weekday2
count_by_weekday2 = auto.pivot_table(index='yr',aggfunc=len)


# Verify that the same result is obtained
print('==========================================')
print(count_by_weekday1.equals(count_by_weekday2))

FEATURES                          accel  color  cyl  displ  hp  maker  mpg  \
name                                                                         
audi 4000                             1      1    1      1   1      1    1   
buick century                         1      1    1      1   1      1    1   
buick skyhawk                         1      1    1      1   1      1    1   
chevrolet chevelle concours (sw)      1      1    1      1   1      1    1   
chevrolet monte carlo                 1      1    1      1   1      1    1   

FEATURES                          size  weight  yr  
name                                                
audi 4000                            1       1   1  
buick century                        1       1   1  
buick skyhawk                        1       1   1  
chevrolet chevelle concours (sw)     1       1   1  
chevrolet monte carlo                1       1   1  
False


### Using margins in pivot tables

Sometimes it's useful to add totals in the margins of a pivot table. You can do this with the argument margins=True. In this exercise, you will practice using margins in a pivot table along with a new aggregation function: sum

In [62]:
# Create the DataFrame with the appropriate pivot table: signups_and_visitors
signups_and_visitors = auto.pivot_table(index='name',aggfunc=sum)

# Print signups_and_visitors
print(signups_and_visitors.head())

# Add in the margins: signups_and_visitors_total 
signups_and_visitors_total = auto.pivot_table(index='yr',margins=True,aggfunc=sum)

# Print signups_and_visitors_total
print(signups_and_visitors_total.head())

FEATURES                          accel  cyl  displ   hp   mpg       size  \
name                                                                        
audi 4000                          15.8    4     97   78  34.3  13.298178   
buick century                      21.0    6    231  110  17.0  42.401803   
buick skyhawk                      15.0    6    231  110  21.0  25.654225   
chevrolet chevelle concours (sw)   14.0    8    307  130  13.0  46.648900   
chevrolet monte carlo               9.5    8    400  150  15.0  39.292003   

FEATURES                          weight  yr  
name                                          
audi 4000                           2188  80  
buick century                       3907  75  
buick skyhawk                       3039  75  
chevrolet chevelle concours (sw)    4098  72  
chevrolet monte carlo               3761  70  
FEATURES  accel   cyl   displ     hp   mpg        size   weight
yr                                                             
70 

## Grouping data

#### Advantages of categorical data types

- Computations are faster.
- Categorical data require less space in memory.

### Grouping by multiple columns

.groupby() to analyze the distribution 

In [65]:
# Group titanic by 'pclass'
by_class = auto.groupby('name')

# Aggregate 'survived' column of by_class by count
count_by_class = by_class['yr'].count()

# Print count_by_class
print(count_by_class.head())

# Group titanic by 'embarked' and 'pclass'
by_mult = auto.groupby(['name','maker'])

# Aggregate 'survived' column of by_mult by count
count_mult = by_mult['hp'].count()

# Print count_mult
print(count_mult.head())

name
audi 4000                           1
buick century                       1
buick skyhawk                       1
chevrolet chevelle concours (sw)    1
chevrolet monte carlo               1
Name: yr, dtype: int64
name                              maker
audi 4000                         s        1
buick century                     o        1
buick skyhawk                     o        1
chevrolet chevelle concours (sw)  o        1
chevrolet monte carlo             o        1
Name: hp, dtype: int64


### Grouping by another series

In [67]:
# Read life_fname into a DataFrame: life
#life = pd.read_csv(life_fname, index_col='Country')

# Read regions_fname into a DataFrame: regions
#regions = pd.read_csv(regions_fname, index_col='Country')

# Group life by regions['region']: life_by_region
life_by_region = auto.groupby(auto['name'])

# Print the mean over the '2010' column of life_by_region
print(life_by_region['hp'].mean())

name
audi 4000                            78
buick century                       110
buick skyhawk                       110
chevrolet chevelle concours (sw)    130
chevrolet monte carlo               150
chevy s-10                           82
datsun 200sx                        100
dodge challenger se                 170
dodge rampage                        84
fiat 124 sport coupe                 90
ford gran torino                    140
ford granada                         98
ford ltd                            148
ford mustang                         88
ford torino                         140
hi 1200d                            193
honda civic cvcc                     60
mazda 626                            75
mazda rx-4                          110
mercury cougar brougham             130
plymouth valiant custom              95
pontiac astro                        78
pontiac sunbird coupe                88
renault 5 gtl                        58
toyota celica gt                   

### Groupby and aggregation

#### Aggregation functions
- string names 
- ‘sum’
- ‘mean’
- ‘count’

### Computing multiple aggregates of multiple columns

The .agg() method can be used with a tuple or list of aggregations as input. When applying multiple aggregations on multiple columns, the aggregated DataFrame has a multi-level column index.

In [71]:
# Group titanic by 'pclass': by_class
by_class = auto.groupby('name')

# Select 'age' and 'fare'
by_class_sub = by_class[['yr','hp']]

# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(['max', 'median'])

# Print the maximum age in each class
print(aggregated.loc[:, ('yr','max')])

# Print the median fare in each class
print(aggregated.loc[:, ('hp','median')])

name
audi 4000                           80
buick century                       75
buick skyhawk                       75
chevrolet chevelle concours (sw)    72
chevrolet monte carlo               70
chevy s-10                          82
datsun 200sx                        81
dodge challenger se                 70
dodge rampage                       82
fiat 124 sport coupe                73
ford gran torino                    74
ford granada                        77
ford ltd                            75
ford mustang                        71
ford torino                         70
hi 1200d                            70
honda civic cvcc                    78
mazda 626                           80
mazda rx-4                          77
mercury cougar brougham             77
plymouth valiant custom             75
pontiac astro                       75
pontiac sunbird coupe               77
renault 5 gtl                       77
toyota celica gt                    82
toyota corolla      

### Aggregating on index levels/fields

If you have a DataFrame with a multi-level row index, the individual levels can be used to perform the groupby. This allows advanced aggregation techniques to be applied along one or more levels in the index and across one or more columns.

### Grouping on a function of the index

Groubpy operations can also be performed on transformations of the index values. In the case of a DateTimeIndex, we can extract portions of the datetime over which to group.

## Groupby and transformation

### Detecting outliers with Z-Scores

using the zscore function, you can apply a .transform() method after grouping to apply a function to groups of data independently. The z-score is also useful to find outliers: a z-score value of +/- 3 is generally considered to be an outlier.

### Filling missing data (imputation) by group

Many statistical and machine learning packages cannot determine the best action to take when missing data entries are encountered. Dealing with missing data is natural in pandas (both in using the default behavior and in defining a custom behavior). In Chapter 1, you practiced using the .dropna() method to drop missing values. Now, you will practice imputing missing values. You can use .groupby() and .transform() to fill missing data appropriately for each group.

### Other transformations with .apply

The .apply() method when used on a groupby object performs an arbitrary function on each of the groups. These functions can be aggregations, transformations or more complex workflows. The .apply() method will then combine the results in an intelligent way.

## Groupby and filtering

### Grouping and filtering with .apply()

By using .apply(), you can write functions that filter rows within groups. The .apply() method will handle the iteration over individual groups and then re-combine them back into a Series or DataFrame.

def c_deck_survival(gr):

    c_passengers = gr['cabin'].str.startswith('C').fillna(False)

    return gr.loc[c_passengers, 'survived'].mean()

### Grouping and filtering with .filter()

You can use groupby with the .filter() method to remove whole groups of rows from a DataFrame based on a boolean condition.

### Filtering and grouping with .map()

Sometimes, you may instead want to group by a function/transformation of a column. The key here is that the Series is indexed the same way as the DataFrame. You can also mix and match column grouping with Series grouping.

## Case Study:

#### Reminder: indexing & pivoting

- Filtering and indexing 
 - One-level indexing
 - Multi-level indexing
- Reshaping DataFrames with pivot() 
- pivot_table()

#### Reminder: groupby
- Useful DataFrame methods 
 - unique()
 - value_counts()
 - Aggregations, transformations, filtering
 
### Grouping and aggregating

USA_edition_grouped['Medal'].count()

### Using .value_counts() for ranking

Notice that .value_counts() sorts by values by default. The result is returned as a Series of counts indexed by unique entries from the original Series with values (counts) ranked in descending order.

### Using .pivot_table() to count medals by type

Rather than ranking countries by total medals won and showing that list, you may want to see a bit more detail. You can use a pivot table to compute how many separate bronze, silver and gold medals each country won. That pivot table can then be used to repeat the previous computation to rank by total medals won.

## Understanding the column labels

#### Reminder: slicing & filtering
- Indexing and slicing
 - .loc[] and .iloc[] accessors
- Filtering
 - Selecting by Boolean Series
 - Filtering null/non-null and zero/non-zero values

#### Reminder: Handling categorical data
- Useful DataFrame methods for handling categorical data: 
 - value_counts()
 - unique()
 - groupby()
- groupby() aggregations:
 - mean(), std(), count()
 
### Applying .drop_duplicates()

The duplicates can be dropped using the .drop_duplicates() method, leaving behind the unique observations. 

### Finding possible errors with .groupby()

You will now use .groupby() to continue your exploration

### Locating suspicious data

You will now inspect the suspect record by locating the offending row.

## Constructing alternative country rankings

#### Two new DataFrame methods
- idxmax(): Row or column label where maximum value is located
- idxmin(): Row or column label where minimum value is located

### Using .nunique() to rank by distinct sports

