<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Aggregating And Combining DataFrames              
</p>
</div>

DS-NTL-010824
<p>Phase 1: Topic 5.2</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

## Objectives

- Use GroupBy objects to organize and aggregate data
- Create pivot tables from DataFrames
- Combine DataFrames by merging, joining, and concatenating

Categorical variable taking on a few discrete values.

Each of these values form a group. Want to:
- Calculate statistics on various quantities for each group (mean, etc.)
- Transform/scale certain columns differently for each group.


DataFrame.groupby() allows us to do this.

Take the Titanic dataset again:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

titanic_df = pd.read_csv('Data/titanic.csv')
titanic_df.head()

Sex as  relevant categorical variable:
- survival rate
- distribution of ages
- fare

# groupby 

In [None]:
titanic_subset = titanic_df[['Sex', 'Survived', 'Age', 'Fare']]
titanic_subset.groupby('Sex')

groupby object has many useful methods for processing data by group.

#### Aggregation methods 

- Methods that compute statistics across the different groups.
- Common aggregation methods:
    - .min(): returns the minimum value for each column by group
    - .max(): returns the maximum value for each column by group
    - .mean(): returns the average value for each column by group
    - .median(): returns the median value for each column by group
    - .count(): returns the count of each column by group
    - .sum(): return sum of each column by group

Computing the mean of columns by group:
- Note: mean of Survived is the survival fraction.

In [None]:
titanic_subset.groupby('Sex').mean()

In [None]:
titanic_subset.groupby('Sex')['Fare'].mean()

Any obvious distinctions between groups here?

#### .agg(func) method
Can write your own aggregations.
- Get square root of the sum of squares of desired columns.

In [None]:
titanic_subset.groupby('Sex').agg(lambda x: np.sqrt(np.sum(x**2)))

#### .transform(func) method
- This is not an aggregation.
- Transforms entries in each column differently according to their group.




Example: standardize columns for each sex separately:

- Subtract entries of columns in each sex category by the column mean for that sex.
- Then divide by the standard deviation of fare for that sex.

In [None]:
titanic_subset.groupby('Sex').transform(lambda col: (col - col.mean())/col.std())

#### Grouping by multiple categorical variables

- Split data into multiple levels of groups. 
- Group by sex (Male/Female) with subgroups in each according to passenger class.

df.groupby() takes in list of categorical columns to group on:

In [None]:
titanic_subset2 = titanic_df[['Sex', 'Pclass', 'Survived', 'Age', 'Fare']]
titanic_subset2.groupby(['Sex','Pclass'])

Calculate mean of attributes within these groups/subgroups:

In [None]:
grouped_df = titanic_subset2.groupby(['Sex','Pclass']).mean()
#grouped_df = titanic_subset2.groupby(by =['Sex','Pclass']).agg('mean')

grouped_df

#### Basic Ideas of Data Shaping in Pandas
1. Wide vs. Long Formats


<div>
    <center><img src="Images/hw_wide.png" align = "center" width="400"/></center>
    <center>Wide format</center>
</div>
    

<div align>
        <center><img src="Images/hw_long.png" align = "center" width="300"/></center>
    <center>Long format</center>
</div>

#### Pivoting

- Convert from a long to a wide format:

   - DataFrame.pivot(index, columns, values):
  
 One attribute becomes index, values in other attribute becomes labels for new columns.
 
 Best to see an example:

In [None]:
value_list = [182, 160, 130, 78, 67, 52]
physical_data = pd.DataFrame(np.array([['John', 'Christopher', 'Melinda']*2, ['Height', 'Weight']*3, value_list]).T,
             columns = ['name', 'attribute', 'value'])

physical_data


This is long form. Use pivot to convert to wide format.

In [None]:
wide_form = physical_data.pivot(index = 'name', columns = 'attribute', values = 'value')
wide_form

#### Melting: the inverse of pivoting.

- Take data from wide to long format.
- pd.melt(dataframe, id_vars, value_vars, var_name, value_name)

In [None]:
wide_form.reset_index(inplace = True)
wide_form

In [None]:
pd.melt(wide_form, 
        id_vars = ['name'], 
        value_vars = ['Height', 'Weight'])

#### Pivot Tables

- When the columns you want to pivot on have non-unique entries.
- E.g., temperature as function of position X,Y for a given month but multiple measurements at each X,Y
- Want average of these measurements at each X,Y in pivoted form:

    - df.pivot_table(..., aggfunc = __)

Forest fire dataset:

Looks at temperature logged at various X, Y positions in a forest over several months.

In [None]:
forest_df = pd.read_csv('Data/forestfires.csv', usecols = ['X', 'Y', 'month', 'day', 'temp'])
inamonth_df = forest_df[(forest_df['month'] == 'mar')]

inamonth_df.head(10)

Average temperature at (X, Y) positions for March. Organized as pivot table:

In [None]:
inamonth_df.pivot_table(index = 'X', columns = 'Y', values = 'temp', aggfunc = 'mean')

#### Multiindexing
- Setting multiple columns as index
- Setting hierarchies.
- Accessing data in multi-indexed DataFrames.

Airfoil noise dataset:
- Various factors affecting sound amplitude off of airplane wings.

In [None]:
colnames = ['Frequency [Hz]', 'Angle of attack [deg]', \
            'Chord length [m]', 'Free-stream velocity [m/s]', \
            'Suction side thickness [m]', 'Sound volume [dB]']
airfoil_df = pd.read_csv('Data/airfoil_self_noise.dat', delimiter='\t', header = None, names = colnames  )

airfoil_df.head()

Setting multiple attributes as indices can give us flexibility in addressing the data.
- How does sound amplitude depend just on frequency, stream velocity, and foil chord length?
- Create hierarchical Multiindex:

In [None]:
col_subset = ['Frequency [Hz]', 'Free-stream velocity [m/s]', 'Angle of attack [deg]', 'Sound volume [dB]']
airfoil_df = airfoil_df[col_subset].set_index(col_subset[0:3])
airfoil_df.head()

Moved columns to index, but hierarchical structure of indices not set:
- Can be accomplished with the .sort_index() method.

In [None]:
airfoil_df = airfoil_df.sort_index()
airfoil_df.head(10)

In [None]:
for i in [200]:
    airfoil_df.loc[i,31.7,: ].values/sum(airfoil_df.loc[200,31.7,: ].values) #[airfoil_df['Sound volume [dB]'].value_counts()

#### Accessing via the .loc accessor on multi-indices
-DataFrame.loc[first_level_index, columns]
- Dataframe.loc[(first_level_index, second_level_index, third_level_index), columns]

In [None]:
# at frequency = 1000 Hz
airfoil_df.loc[1000, :]

In [None]:
# sound vol vs angle of attack
# fixed at 1000 Hz, 55.5 m/s stream velocity
airfoil_df.loc[(1000, 55.5)]

Swapping level hierarchy:
- Look at measurement/response keeping one variable fixed and varying another.
- Swapping level hierarchy switches which we keep fixed and which we vary.


In [None]:
swapped_df = airfoil_df.swaplevel('Free-stream velocity [m/s]', 'Angle of attack [deg]').sort_index()

In [None]:
swapped_df.head()

In [None]:
swapped_df.loc[(1000, 7.3)]

Multi-indexing opens up many possibilities for data manipulation.

Strongly encourage you to look at supplementary material and pandas documentation!

# Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`

Many ways to combine dataframes! Luckily, pandas has great docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

# Concat

In [None]:
ds_chars = pd.read_csv('data/ds_chars.csv',index_col=0)
ds_chars

In [None]:
prefs = pd.read_csv('data/preferences.csv', index_col=0)
prefs

Would you concat on axis = 0 or axis = 1

In [None]:
ds_full = pd.concat([ds_chars, prefs], axis=1)
ds_full

# join & merge

Datasets do not have to have same rows or columns.
- Just a common key (or set of keys) used to match records.

pd.merge() is the most flexible workhorse function for this:

This parameter in both `.join()` and `.merge()` tells the compiler what sort of join to effect. We'll cover this in detail when we discuss SQL.

![image showcasing how the how parameter in a join/merge would combine the two datasets, using venn-style diagrams](https://www.datasciencemadesimple.com/wp-content/uploads/2017/09/join-or-merge-in-python-pandas-1.png)
[[Image Source]](https://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/)

In [None]:
# create two datasets
import pandas as pd
df1 = pd.DataFrame({'employee': ['Chadwick', 'Bartholemew', 'Jake', 'Brunnhilde', 'Sue', 'Jimbo Jr.'],
                    'group': ['Building' ,'Accounting', 'Engineering', 'Engineering', 'HR', 'Compliance']})

df2 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR', 'Endowment'],
                    'supervisor': ['Carly', 'Guido', 'Steve', 'Eileen']})
df3 = pd.DataFrame({'name': ['Brunnhilde', 'Bartholemew', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})


In [None]:
df1

In [None]:
df2

In [None]:
pd.merge(df1, df2, how = 'inner', on = 'group')

In [None]:
pd.merge(df1, df2, how = 'left', on = 'group')

In [None]:
pd.merge(df1, df2, how = 'right', on = 'group')

merge on key with different label:

In [None]:
df1


In [None]:
df3

In [None]:
pd.merge(df1, df3, left_on = 'employee', right_on = 'name', how = 'inner')
# what names will ne included?

Can do a bit more with merge: 
- merge matching on multiple columns as opposed to one.
- df1.join(df2, how = ' '): similar to merge but less flexible. Joins on index. Faster than merge.


In [None]:
df1.set_index('group').join(df2.set_index('group'), how = 'inner')

Data in real life can be messy:

- Often keys have mispellings or don't exactly match up
- Determine whether key is similar enough.
- Then link record if true.