## Merging Multiple Datasets

When we can merge datasets,
* we have multiple small datasets
* We need a large dataset for deep learning


### Example

#### Problem Statement: 
predicting weight of person based on his/her height. Using height and weight datsets.

#### Datasets:
https://www.kaggle.com/mustafaali96/weight-height

https://www.kaggle.com/burnoutminer/heights-and-weights-dataset

### Things to keep in mind

*   Number of features(Tabular)/characterisitcs(Image) in datasets
*   Unit of each feature value. Ex. Height in inches, cm, meter, feets.
*   distribution of data - Is data set balanced one?

### Lets Code

In [None]:
import pandas as pd

#### Dataset1

In [None]:
url1 ="https://raw.githubusercontent.com/datamagic2020/SampleDataStore/main/height_weight_dataset1.csv"
dataset1 = pd.read_csv(url1)

In [None]:
dataset1.head()

Unnamed: 0,Index,Height(Inches),Weight(Pounds)
0,1,65.78331,112.9925
1,2,71.51521,136.4873
2,3,69.39874,153.0269
3,4,68.2166,142.3354
4,5,67.78781,144.2971


#### Dataset2

In [None]:
url2="https://raw.githubusercontent.com/datamagic2020/SampleDataStore/main/height_weight_dataset2.csv"
dataset2 = pd.read_csv(url2)

In [None]:
print("dataset1:")
display(dataset1.head())
print("dataset2:")
display(dataset2.head())


dataset1:


Unnamed: 0,Index,Height(Inches),Weight(Pounds)
0,1,65.78331,112.9925
1,2,71.51521,136.4873
2,3,69.39874,153.0269
3,4,68.2166,142.3354
4,5,67.78781,144.2971


dataset2:


Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


#### Handle features

Drop not required features/columns

In [None]:
display(dataset1.head())
dataset1.drop(columns=['Index'],inplace=True)
display(dataset1.head())

Unnamed: 0,Index,Height(Inches),Weight(Pounds)
0,1,65.78331,112.9925
1,2,71.51521,136.4873
2,3,69.39874,153.0269
3,4,68.2166,142.3354
4,5,67.78781,144.2971


Unnamed: 0,Height(Inches),Weight(Pounds)
0,65.78331,112.9925
1,71.51521,136.4873
2,69.39874,153.0269
3,68.2166,142.3354
4,67.78781,144.2971


In [None]:
display(dataset2.head())
dataset2.drop(columns=['Gender'], inplace=True)
display(dataset2.head())

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


Unnamed: 0,Height,Weight
0,73.847017,241.893563
1,68.781904,162.310473
2,74.110105,212.740856
3,71.730978,220.04247
4,69.881796,206.349801


Rename Feature/column names

In [None]:
# make all dataset feature names similar
dataset2.rename(columns = {'Height':'Height(Inches)','Weight':'Weight(Pounds)'}, inplace = True)
dataset2.head()

Unnamed: 0,Height(Inches),Weight(Pounds)
0,73.847017,241.893563
1,68.781904,162.310473
2,74.110105,212.740856
3,71.730978,220.04247
4,69.881796,206.349801


Total Records in each dataset

In [None]:
display(dataset1.shape,dataset2.shape)

(25000, 2)

(10000, 2)

#### Merge both Datasets

In [None]:
merged_dataset = pd.concat([dataset1, dataset2])
merged_dataset.shape

(35000, 2)

In [None]:
merged_dataset.head()

Unnamed: 0,Height(Inches),Weight(Pounds)
0,65.78331,112.9925
1,71.51521,136.4873
2,69.39874,153.0269
3,68.2166,142.3354
4,67.78781,144.2971


### Alternatives

*   Data Augmentation
*   Use another data set as test or validation data
