# Combining Datasets in Pandas: Concat and Append

- Pandas provides functions that make these operations efficient and easy to perform.

# 1. pd.concat
- On Series
- On DF

In [2]:
import pandas as pd 
import numpy as np

In [18]:
# Create a function to quickly generate a DF

from typing import List, Iterable, Any
def make_df(cols: Iterable[Any], ind:Iterable[Any]) -> pd.DataFrame:
    """
    Quickly make a Dataframe
    """

    data = {c: [str(c)+str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, index=ind)

In [19]:
make_df('ABC',range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


## 1.1 Recall: Concatenation in numpy

- In numpy we use:
  - np.concatenate(array, axis=0)
  - np.vstack
  - np.hstack

- Similarly, pandas has pd.concat(), but with many other features

## 1.2 Simple Concatenation with pd.concat

- By default axis = 0.

``` python
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, 
          keys=None, levels=None, names=None, verify_integrity=False, 
          copy=True)
```

In [20]:
# 1. pd.concat() in Series

ser1 = pd.Series(['A','B','C'], index= [1,2,3])
ser1

1    A
2    B
3    C
dtype: object

In [21]:
ser2 = pd.Series(['D', 'E', 'F'], index= [4,5,6])
ser2

4    D
5    E
6    F
dtype: object

In [22]:
pd.concat([ser1, ser2])  # default axis i s'row'

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

In [23]:
pd.concat([ser2, ser1])  # Change position of series name

4    D
5    E
6    F
1    A
2    B
3    C
dtype: object

- Providing axis is useful when working with higher dimensional objects like DataFrames.

In [24]:
# 2. pd.concat() in DataFrames

# Create two DF with same cols
df1 = make_df('AB', [1,2])
df1

Unnamed: 0,A,B
1,A1,B1
2,A2,B2


In [31]:
df2 = make_df('AB', [3,4])
df2

Unnamed: 0,A,B
3,A3,B3
4,A4,B4


In [32]:
pd.concat([df1,df2], axis=0)

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


In [37]:
# Create two DF with same index
df1 = make_df('AB', [1,2])

df2 = make_df('CD', [1,2])

pd.concat([df1, df2], axis=1)   # Can write axis= 'col'

Unnamed: 0,A,B,C,D
1,A1,B1,C1,D1
2,A2,B2,C2,D2


In [38]:
pd.concat([df1,df2])

Unnamed: 0,A,B,C,D
1,A1,B1,,
2,A2,B2,,
1,,,C1,D1
2,,,C2,D2


- See, it concatenated and wrote the duplicate indices separately.

## 1.3 Duplicate Indices

- Sometimes we need to handle the duplication of indices as seen in above example.
- Pandas offer some solutions.

1. Verify No Overlapping Indices
   - Use verify_integrity=True to raise an exception if duplicate indices exist.
  
2. Ignore Index
   - Use ignore_index=True to reset the index and avoid duplication.
  
3. Add MultiIndex Keys
   - Use keys to create a hierarchical index.

In [47]:
# Create duplicate index data

df1 = make_df('AB', [0,1])
df1

Unnamed: 0,A,B
0,A0,B0
1,A1,B1


In [48]:
df2 = make_df('AB', [0,1])
df2

Unnamed: 0,A,B
0,A0,B0
1,A1,B1


In [49]:
pd.concat([df1, df2])  # Duplication of index

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
0,A0,B0
1,A1,B1


In [50]:
# Remove duplication

# 1. Raise an Exception
try:
    pd.concat([df1, df2], verify_integrity=True)
except ValueError as e:
    print(e)

Indexes have overlapping values: Index([0, 1], dtype='int64')


- This will just raise an error when there are duplicate indices.

In [52]:
# 2. Ignore index and reassign new ones

pd.concat([df1,df2], ignore_index=True)

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A0,B0
3,A1,B1


- This ignored the duplicate indices and assigned new ones.

In [53]:
# 3. Make them multiindes by giving key names
pd.concat([df1,df2], keys=['df1', 'df2'])

Unnamed: 0,Unnamed: 1,A,B
df1,0,A0,B0
df1,1,A1,B1
df2,0,A0,B0
df2,1,A1,B1


## 1.4  Concatenation with joins

- When we are concatenating two DFs but they have some mismatched columns.
- Then Pandas gives as option to concat them in various ways using `joins`
  1. join ='outer' 
    - Union of columns
    - Default behavior.
    - Includes all columns from both DataFrames.
    - Missing values are filled with NaN.
  
  2. join = 'inner'
    - Intersection of columns
    - Includes only columns that are present in both DataFrames.

In [56]:
# Create sample DFs with mismathcing cols
df1 = make_df('ABC', [1,2,3])
df2 = make_df('BCD', [4,5,6])

In [57]:
df1

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3


In [58]:
df2

Unnamed: 0,B,C,D
4,B4,C4,D4
5,B5,C5,D5
6,B6,C6,D6


In [60]:
# Simple concat
pd.concat([df1,df2])  # Defaul : Outer join

Unnamed: 0,A,B,C,D
1,A1,B1,C1,
2,A2,B2,C2,
3,A3,B3,C3,
4,,B4,C4,D4
5,,B5,C5,D5
6,,B6,C6,D6


In [61]:
# Outer join
pd.concat([df1,df2], join='outer')  # All columns of all DF

Unnamed: 0,A,B,C,D
1,A1,B1,C1,
2,A2,B2,C2,
3,A3,B3,C3,
4,,B4,C4,D4
5,,B5,C5,D5
6,,B6,C6,D6


In [62]:
# inner join

pd.concat([df1,df2], join='inner')  # only common columns

Unnamed: 0,B,C
1,B1,C1
2,B2,C2
3,B3,C3
4,B4,C4
5,B5,C5
6,B6,C6


## 1.5 The append() method 

- Depricated