In [None]:
import pandas as pd

# Merging and Joining DataFrames
Sometimes when we are working on a big project and data is coming from different sources then we need to combine those data as one DataFrame.

## Merge method
Documentation: [Pandas.DataFrame.merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)

Merging is the process of joining two distinct DataFrames.  The process of merging them can be accomplished 5 different ways.  We will learn about four of them (inner, outer, left, right)

<img src=https://miro.medium.com/max/3600/1*9eH1_7VbTZPZd9jBiGIyNA.png width=50% height=50%>
# DataFrame 1
<img src=https://drive.google.com/uc?id=1d-8a4zcx34EKUE-bHN_eYyFksldygJtA>
# DataFrame 2
<img src=https://drive.google.com/uc?id=1pt2DhR1VqixV8H6uFUm8vyOXrg4-P-_T>
# Inner Merge
<img src=https://drive.google.com/uc?id=1wGKCyUnH69U6GztQ1JS7AwWiL-KmnLMH>
# Left Merge
<img src=https://drive.google.com/uc?id=1trPZoVqGmqnute1xsOro1EcxVQcCuFy6>
# Right Merge
<img src=https://drive.google.com/uc?id=1202Zym48VEx6w5tL02_W2fhalS1CV27Z>
# Outer Merge
<img src=https://drive.google.com/uc?id=1g49zV0lpmk3gvy9eQfWd_p-zMiY8FU1l>

In [None]:
# Create 2 DataFrames that will be merged
data_list = [ ['Larry',2,3,4], ['Matt',6,7,8], ['Kass', 9, 10, 11], ['Ben', 12, 13,  14] ]
data_dict = {
             'name':['Larry','Matt','Kass', 'Molly'],
             'apples':[14, 15, 16, 17],
             'book_id': [123, 456, 789, 101],
            }
df1 = pd.DataFrame(data_list, columns=['name','bubbles', 'songs', 'balls'])
df2 = pd.DataFrame(data_dict)

### Merge method with inner Joins

In [None]:
print(df1)
print('\n')
print(df2)

In [None]:
df1.merge(df2, how='inner', on='name')

### Merge method with left Joins

In [None]:
df1.merge(df2, how='left', on='name')

### Merge method with right Joins

In [None]:
df1.merge(df2, how='right', on='name')

### Merge method with outer Joins

In [None]:
df1.merge(df2, how='outer', on='name')

### Multiple merges

In [None]:
df3 = pd.DataFrame({'book_id': [121, 456, 789, 101, 123],
                    'book_name': ['Applied Text Analysis with Python',
                                  "Pandas 1.x Cookbook", "Think Stats",
                                  "Think Python", "Visual Analytics for Machine Learning"]})

In [None]:
df3.merge(df1.merge(df2, how='inner', on='name'), how='left', on='book_id')

# DataFrame Groupby

### grouping with a single column

## Grouping and aggregating with multiple columns and functions

# Aggregating Data - DataFrame Groupby
Documentation: [Pandas.DataFrame.Groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby)

- books.groupby(['category','rank']) # grouping with a list

- books.groupby('category') #single column group

Returns a Groupby Object

The most common use of the `.groupby` method is to perform an aggregation.
Aggregation is when a sequence of many inputs get summarized or combined into a single value output


Aggregation has two components
1. The aggregating columns - values that will be aggreaged
2. The aggregation function - define what aggregations will take place

Example Aggregating functions:

`mean` - Taking the mean of a column

`sum` - summing up all the values of a column

`min` - find minimum value in column

`max` - find max value in column

`count` - compute count of group, excluding missing values.

`var` - find variance of column values

`std` - find standard deviation of column values

`size` - find total number within a group

In [None]:
unames = ['category', 'book_id', 'book_namme', 'rank', 'sales', 'type', 'sold_out', 'best_seller']
books = pd.read_table('book_categories.txt', sep='--', names = unames, engine='python', header=0)

## Grouping with a single column

In [None]:
# pass in a dictionary and returns a DataFrame
books.groupby('category').agg({'sales':'mean'})

In [None]:
# sort sales from high to low
books.groupby('category').agg({'sales':'mean'}).sort_values('sales', ascending=False)

In [None]:
# Placing the aggregating column in the index operator and then pass the aggregating function as a string to .agg will return a Series
books.groupby('category')['sales'].agg('mean')

In [None]:
books.groupby('category')['sales'].mean()

## Grouping and aggregating with multiple columns and functions
First identify - the grouping columns, aggregating columns, and aggregating functions

Answer the following Questions:
1. Find the sum of best sellers for every category per type
2. Find the sum and mean of sold out and best seller books for every category per type
3. For each category and rank, find the total number (size) of books, the number and mean of best sellers and the mean and variance of the sales

In [None]:
# 1. Find the sum of best sellers for every category per type#
books.groupby(['category', 'type'])[['best_seller']].agg('sum')

In [None]:
books.groupby(['category', 'type'])[['best_seller']].agg('sum').sort_values('best_seller', ascending=False)

To group by multiple columns as in step 1, we pass a list of the string names to the .groupby
method. Each unique combination of category and type forms its own group. Within 
each of these groups, the sum of the best sellers is calculated 

In [None]:
# 2. Find the sum and mean of sold out and best seller books for every category per type
books.groupby(['category', 'type'])[['sold_out', 'best_seller']].agg(['sum', 'mean'])


Step 2 groups by both category and type, but this time aggregates two columns. 
It applies each of the two aggregation functions, using the strings sum and mean, to each 
column, resulting in four returned columns per group.

In [None]:
# 3. For each category and rank, find the total number (size) of books, the number and mean of best sellers and the mean and variance of the sales
books.groupby(['category', 'rank']).agg({'best_seller':['size', 'sum', 'mean'],
                                        'sales':['mean','var']})

Step 3 goes even further, and uses a dictionary to map specific aggregating columns to 
different aggregating functions. Notice that the size aggregating function returns the total 
number of rows per group. This is different than the count aggregating function, which 
returns the number of non-missing values per group.

In [None]:
#sort by mean sales - high to low
books.groupby(['category', 'rank']).agg({'best_seller':['size', 'sum', 'mean'],
                                        'sales':['mean','var']}).sort_values(('sales', 'mean'), ascending=False)

# Pair Programming
1. Merge the df_customer and df_info DataFrames by id with an right join and assign the resulting DataFrame to the variable name `combined`
2. Aggregrate the `combined` DataFrame.  Groupby the `sex` column and aggregate `age` and `weight` by mean then sort by weight

In [None]:
df_customer = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Tom', 'Jenny', 'James', 'Dan'],
})

df_info = pd.DataFrame({
    'id': [2, 3, 4, 5],
    'age': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F'],
    'weight': [125, 185, 220, 130]
})