Topic 4: **Data Integration**

Data integration involves combining data from multiple sources into a unified dataset for analysis or modeling purposes. This is often necessary when dealing with real-world datasets that are spread across different sources, such as databases, files, or APIs. Let's explore how to integrate multiple datasets using Python:

### Combining Multiple Datasets

#### a. Concatenation

Concatenation is the process of combining datasets along a particular axis, either row-wise or column-wise. This is useful when datasets have the same columns and can be stacked on top of each other or side by side.

In [1]:
import pandas as pd

# Example datasets
data1 = pd.DataFrame({'A': [1, 2, 3]})
data2 = pd.DataFrame({'A': [4, 5, 6]})

# Concatenating datasets along rows
combined_data_row = pd.concat([data1, data2], axis=0)

# Concatenating datasets along columns
combined_data_col = pd.concat([data1, data2], axis=1)

print("Combined data along rows:")
print(combined_data_row)
print("\nCombined data along columns:")
print(combined_data_col)

Combined data along rows:
   A
0  1
1  2
2  3
0  4
1  5
2  6

Combined data along columns:
   A  A
0  1  4
1  2  5
2  3  6


#### b. Merging

Merging is the process of combining datasets based on common columns or indices. This is useful when datasets have different columns but share some common identifiers.

In [5]:
import pandas as pd

# Example datasets
data1 = pd.DataFrame({'ID': [1, 2, 3],
                      'Value1': [10, 20, 30]})
data2 = pd.DataFrame({'ID': [2, 3, 4],
                      'Value2': [40, 50, 60]})

# Merging datasets based on 'ID'
merged_data = pd.merge(data1, data2, on='ID', how='right')

print("Merged data:")
print(merged_data)

Merged data:
   ID  Value1  Value2
0   2    20.0      40
1   3    30.0      50
2   4     NaN      60


#### c. Joining

Joining is similar to merging but is specifically used when combining datasets based on their indices.

In [6]:
import pandas as pd

# Example datasets
data1 = pd.DataFrame({'Value1': [10, 20, 30]}, index=['A', 'B', 'C'])
data2 = pd.DataFrame({'Value2': [40, 50, 60]}, index=['B', 'C', 'D'])

# Joining datasets based on indices
joined_data = data1.join(data2, how='left')

print("Joined data:")
print(joined_data)

Joined data:
   Value1  Value2
A      10     NaN
B      20    40.0
C      30    50.0
