# Merging Data avec Pandas 

Dans la science de données on a souvent besoin de fusionner des tableaux. On peut le faire avec la fonction `merge()` qui prend comme paramètres deux `DataFrames` ainsi que les colonnes avec lesquelles réaliser les jointures. Cette méthode est très similaire à SQL sur le principe 🔭


### Importer pandas et les dataset `user_device.csv`, `user_usage.csv` et `android_devices.csv`

In [1]:
import pandas as pd
user_device = pd.read_csv("../DATA/user_device.csv")
user_usage  = pd.read_csv("../DATA/user_usage.csv")
android_devices  = pd.read_csv("../DATA/android_devices.csv")

### Importer les dataset & renommer la colonne `Branding` du dataset `android_devices.csv`

In [2]:
android_devices.rename(columns = {'Retail Branding': 'Manufacturer'}, inplace=True)
print(android_devices)

      Manufacturer Marketing Name       Device                      Model
0              NaN            NaN       AD681H  Smartfren Andromax AD681H
1              NaN            NaN        FJL21                      FJL21
2              NaN            NaN          T31              Panasonic T31
3              NaN            NaN     hws7721g         MediaPad 7 Youth 2
4               3Q        OC1020A      OC1020A                    OC1020A
...            ...            ...          ...                        ...
14541        pendo    PNDPP44QC10  PNDPP44QC10                PNDPP44QC10
14542        pendo     PNDPP44QC7   PNDPP44QC7                 PNDPP44QC7
14543   sugar_aums         QPOINT        QPI-1                      QPI-1
14544    tecmobile       OmnisOne     OmnisOne                  Omnis One
14545        ucall          EASY1        EASY1                      EASY1

[14546 rows x 4 columns]


### Afficher les dataset 

In [3]:
user_usage

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id
0,21.97,4.82,1557.33,22787
1,1710.08,136.88,7267.55,22788
2,1710.08,136.88,7267.55,22789
3,94.46,35.17,519.12,22790
4,71.59,79.26,1557.33,22792
...,...,...,...,...
235,260.66,68.44,896.96,25008
236,97.12,36.50,2815.00,25040
237,355.93,12.37,6828.09,25046
238,632.06,120.46,1453.16,25058


In [4]:
user_device

Unnamed: 0,use_id,user_id,platform,platform_version,device,use_type_id
0,22782,26980,ios,10.2,"iPhone7,2",2
1,22783,29628,android,6.0,Nexus 5,3
2,22784,28473,android,5.1,SM-G903F,1
3,22785,15200,ios,10.2,"iPhone7,2",3
4,22786,28239,android,6.0,ONE E1003,1
...,...,...,...,...,...,...
267,23049,29725,android,6.0,SM-G900F,1
268,23050,29726,ios,10.2,"iPhone7,2",3
269,23051,29726,ios,10.2,"iPhone7,2",3
270,23052,29727,ios,10.1,"iPhone8,4",3


In [5]:
android_devices

Unnamed: 0,Manufacturer,Marketing Name,Device,Model
0,,,AD681H,Smartfren Andromax AD681H
1,,,FJL21,FJL21
2,,,T31,Panasonic T31
3,,,hws7721g,MediaPad 7 Youth 2
4,3Q,OC1020A,OC1020A,OC1020A
...,...,...,...,...
14541,pendo,PNDPP44QC10,PNDPP44QC10,PNDPP44QC10
14542,pendo,PNDPP44QC7,PNDPP44QC7,PNDPP44QC7
14543,sugar_aums,QPOINT,QPI-1,QPI-1
14544,tecmobile,OmnisOne,OmnisOne,Omnis One


### Votre premier merge

In [6]:
result_merge = pd.merge( user_usage, user_device[['use_id', 'platform', 'device']],
                 on='use_id')
result_merge.head()

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id,platform,device
0,21.97,4.82,1557.33,22787,android,GT-I9505
1,1710.08,136.88,7267.55,22788,android,SM-G930F
2,1710.08,136.88,7267.55,22789,android,SM-G930F
3,94.46,35.17,519.12,22790,android,D2303
4,71.59,79.26,1557.33,22792,android,SM-G361F


### Afficher la `shape` de vos dataset ainsi que celle du dataset de sortie 

Que remarquez vous ? 

In [7]:
print("user_device dimensions {}".format(user_device.shape))
print("user_usage dimensions {}".format(user_usage.shape))
print("result dimensions {}".format(result_merge.shape))

user_device dimensions (272, 6)
user_usage dimensions (240, 4)
result dimensions (159, 6)


### Afficher via `value_counts` les `use_id` présent dans le nouveau dataset ainsi que ceux non présent 

In [8]:
user_usage["use_id"].isin(user_device["use_id"]).value_counts()

True     159
False     81
Name: use_id, dtype: int64

### Le left merge

Afficher la `shape` du dataset `user_usage`, celle du dataset de sortie ainsi que les valeurs manquantes. 

In [9]:
result = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id', how='left')
print("user_usage dimensions: {}".format(user_usage.shape))
print("result dimensions: {}".format(result.shape))
print("There are {} missing values in the result.".format(result['platform'].isnull().sum()))

user_usage dimensions: (240, 4)
result dimensions: (240, 6)
There are 81 missing values in the result.


### Afficher votre dataset 

In [10]:
result

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id,platform,device
0,21.97,4.82,1557.33,22787,android,GT-I9505
1,1710.08,136.88,7267.55,22788,android,SM-G930F
2,1710.08,136.88,7267.55,22789,android,SM-G930F
3,94.46,35.17,519.12,22790,android,D2303
4,71.59,79.26,1557.33,22792,android,SM-G361F
...,...,...,...,...,...,...
235,260.66,68.44,896.96,25008,,
236,97.12,36.50,2815.00,25040,,
237,355.93,12.37,6828.09,25046,,
238,632.06,120.46,1453.16,25058,,


### Le right merge

Afficher la `shape` du dataset `user_device`, celle du dataset de sortie ainsi que les valeurs manquantes des colonnes `monthly_mb` et `platform`. 

In [19]:
result = pd.merge(user_usage, user_device[['use_id', 'platform', 'device']], on="use_id", how="right")
print("user_device dimensions: {}".format(user_device.shape))
print("result dimensions: {}".format(result.shape))
print("There are {} missing values in the 'monthly_mb' column in the result.".format(result['monthly_mb'].isnull().sum()))
print("There are {} missing values in the 'platform' column in the result.".format(result['platform'].isnull().sum()))

user_device dimensions: (272, 6)
result dimensions: (272, 6)
There are 113 missing values in the 'monthly_mb' column in the result.
There are 0 missing values in the 'platform' column in the result.


### Le outer merge example

Afficher les valeurs unique de `use_id` des datasets `user_device` & `user_usage`, celle du dataset de sortie ainsi que les valeurs no manquantes. 

In [21]:
result = pd.merge(user_usage, user_device, on="use_id", how="outer")
result

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id,user_id,platform,platform_version,device,use_type_id
0,21.97,4.82,1557.33,22787,12921.0,android,4.3,GT-I9505,1.0
1,1710.08,136.88,7267.55,22788,28714.0,android,6.0,SM-G930F,1.0
2,1710.08,136.88,7267.55,22789,28714.0,android,6.0,SM-G930F,1.0
3,94.46,35.17,519.12,22790,29592.0,android,5.1,D2303,1.0
4,71.59,79.26,1557.33,22792,28217.0,android,5.1,SM-G361F,1.0
...,...,...,...,...,...,...,...,...,...
348,,,,23047,29720.0,ios,10.2,"iPhone7,1",2.0
349,,,,23048,29724.0,android,6.0,ONEPLUS A3003,3.0
350,,,,23050,29726.0,ios,10.2,"iPhone7,2",3.0
351,,,,23051,29726.0,ios,10.2,"iPhone7,2",3.0


### Afficher les lignes `0,1,200,201,350,351`

### Ajouter les colonnes `device` & `manufacturer`

### Afficher les `device` commencant par 'GT'

### Afficher le dataset des résultats 

### Grouper vos données par `manufacturer`
Compter les `use_id` et afficher les moyennes des colonnes `outgoing_mins_per_month`, `outgoing_sms_per_month`, `monthly_mb`