# Combining datasets from RDW

Here I attempt to combine as many datasets from RDW as possible. I'll use `pandas` to clean and merge them.

To see the content of multiple datasets I've made a function printing out each dataset. First I save multiple datasets in seperate variables. Then I add all the individual datasets in a list *datasets* and call the show_datasets function. 

In [1]:
import pandas as pd
from functools import reduce


def show_datasets(list):
    for dataset in datasets:
        print(dataset.head(), "\n")


# datasets
AREA_MAN = pd.read_csv("../datasets/RDW/GEBIEDSBEHEERDER.csv")
FARE = pd.read_csv("../datasets/RDW/TARIEFDEEL.csv")
AREA = pd.read_csv("../datasets/RDW/GEBIED.csv")
INDEX = pd.read_csv("../datasets/RDW/Index_Statisch_Dynamisch.csv")
P_OPEN = pd.read_csv("../datasets/RDW/PARKING_OPEN.csv")
P_ACCESS = pd.read_csv("../datasets/RDW/PARKING_TOEGANG.csv")
P_SPEC = pd.read_csv("../datasets/RDW/SPECIFICATIES_PARKEERGEBIED.csv")
TIMEPERIOD = pd.read_csv("../datasets/RDW/TIJDVAK.csv")
USEGOAL = pd.read_csv("../datasets/RDW/GEBRUIKSDOEL.csv")



datasets = [AREA_MAN, FARE, AREA, INDEX, P_OPEN, P_ACCESS, P_SPEC, TIMEPERIOD, USEGOAL]

show_datasets(datasets)


   AreaManagerId            AreaManagerDesc  StartDateAreaManagerId  \
0           2468  Aberdeen Asset Management                20140101   
1           1783                   Westland                20131024   
2           1708            Steenwijkerland                20131018   
3           1895                    Oldambt                20120608   
4            267                    Nijkerk                20120405   

   EndDateAreaManagerId  URL  
0              29991231  NaN  
1              29991231  NaN  
2              29991231  NaN  
3              29991231  NaN  
4              29991231  NaN   

   AreaManagerId FareCalculationCode  StartDateFarePart  \
0            512               TAR04           20150101   
1             34             BEZVGBZ           20161101   
2            150               TAR04           20121218   
3            150               TAR03           20140101   
4            150               TAR04           20150101   

   StartDurationFarePart  EndD

Next I want to see if the names of a column in multiple datasets are the same, so I can merge them. I've made a function that will return the duplicate names in the datasets.

First I create a empty list and set. I didn't know first how to find duplicates in a list so I did a search. I found an article called [Python – How to Find Duplicates in a List](https://shoutthegeek.com/how-to-find-duplicates-in-a-list-in-python/?PageSpeed=noscript) and I found an example of the duplicates being saved in a set. I tried to save it to a list but then the name of the duplicate gets added each time there is a match, resulting in the name of the duplicate being added 3 times.I saw in [this article](https://www.programiz.com/python-programming/set) the definition of a set:

> A set is an unordered collection of items. Every set element is unique (no duplicates) and must be immutable (cannot be changed).

So I used a set instead. 

Then I looped over each dataset and over the column names, saving the names in a list. Then I loop over the list and with `count()` it looks for duplicates. The name will be added to the *duplicates* set. Finally I covert the set to a list and return it.

In [2]:
def get_duplicates(dataset_list):
    lst = []
    duplicates = set()
    for dataset in dataset_list:
        col_names = dataset.columns.values
        for name in col_names:
            lst.append(name)
    for i in lst:
        if lst.count(i) > 1:
            duplicates.add(i)
    return list(duplicates)

dupl = get_duplicates(datasets)
print(dupl)
    

['EndOfPeriod', 'AreaId', 'UsageId', 'StartOfPeriod', 'FareCalculationCode', 'AreaManagerId']


To merge all the dataset I searched for how to do that. I found a question on stackoverflow called [Python: pandas merge multiple dataframes](https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes) and the code uses an anynymous function with *left* and *right* parameters containing the datasets and `pd.merge` to merge them together. I 

In [3]:
def merge_on(data_frames, col_names):
    merged = []
    for i, df in enumerate(data_frames):
        r = (i + 1) % len(data_frames)
        df_set = [df, data_frames[r]]
        for key in col_names:
            try:
                merged = merge_dfs(df_set, key)
                print(f"Match: {key}")
            except KeyError:
                pass
    return merged


def merge_dfs(data_frames, col_name, type="outer"):
    return reduce(lambda left, right: pd.merge(left, right, on=[col_name], how=type), data_frames)


# Wrapping around list https://stackoverflow.com/questions/22122623/wrapping-around-on-a-list-when-list-index-is-out-of-range


def wrap_index_list(list, add=1):
    result = []
    for i in range(len(list)):
        result.append(list.index(list[(i + add) % len(list)]))
    return result


all_data = merge_on(datasets, dupl)
all_data


Match: AreaManagerId
Match: AreaManagerId
Match: EndOfPeriod
Match: AreaId
Match: StartOfPeriod
Match: AreaManagerId
Match: AreaId
Match: AreaManagerId
Match: AreaManagerId
Match: AreaManagerId
Match: AreaManagerId


Unnamed: 0,AreaManagerId,UsageId,UsageIdDesc,StartDateUsageId,EndDateUsageId,SpecificationIndicator,SuperiorAreaManagerId,SuperiorUsageId,AreaManagerDesc,StartDateAreaManagerId,EndDateAreaManagerId,URL
0,2468,GARAGEP,Garageparkeren,20140101,29991231,N,2468.0,PARKEREN,Aberdeen Asset Management,20140101,29991231,
1,2468,VERGUNP,Vergunning Parkeren,20140101,29991231,J,2468.0,PARKEREN,Aberdeen Asset Management,20140101,29991231,
2,2468,PARKEREN,Parkeren,20140101,29991231,J,,,Aberdeen Asset Management,20140101,29991231,
3,2468,BETAALDP,Betaald Parkeren,20140101,29991231,N,2468.0,PARKEREN,Aberdeen Asset Management,20140101,29991231,
4,484,BETAALDP,Betaald Parkeren,20130919,29991231,N,484.0,PARKEREN,Alphen aan den Rijn,20130919,29991231,www.alphenaandenrijn.nl
...,...,...,...,...,...,...,...,...,...,...,...,...
2049,612,VERGUNP,Vergunning Parkeren,20121025,20201019,J,612.0,PARKEREN,Spijkenisse,20121025,29991231,
2050,612,TERREINP,Terrein Parkeren,20131010,20201019,N,612.0,PARKEREN,Spijkenisse,20121025,29991231,
2051,612,PARKEREN,Parkeren,20121025,20201019,J,612.0,,Spijkenisse,20121025,29991231,
2052,612,BETAALDP,Betaald Parkeren,20121025,20201019,N,612.0,PARKEREN,Spijkenisse,20121025,29991231,


I noticed that the data wasn't complete yet so I tried to add more datasets that would match and could be added. I saw that *Gebied.csv* and *PARKIN_OPEN.csv* had the same *AreaManagerId* key and so I made them combine. I've put them in a variable called *AREA_MAN_ID* and made this combine again with the already combined data in *all_data*.

In [4]:
AREA_MAN_ID = merge_dfs([P_OPEN, AREA], "AreaManagerId", "inner")
all_data = merge_dfs([all_data, AREA_MAN_ID], "AreaManagerId", "inner")
all_data.head(10)

Unnamed: 0,AreaManagerId,UsageId_x,UsageIdDesc,StartDateUsageId,EndDateUsageId,SpecificationIndicator,SuperiorAreaManagerId,SuperiorUsageId,AreaManagerDesc,StartDateAreaManagerId,...,StartOfPeriod,PeriodName,ExitPossibleAllDay,OpenAllYear,EndOfPeriod,AreaId_y,AreaDesc,StartDateArea,EndDateArea,UsageId_y
0,484,BETAALDP,Betaald Parkeren,20130919,29991231,N,484.0,PARKEREN,Alphen aan den Rijn,20130919,...,20170211151655,,0,0.0,,484_PALN,Straatparkeren Paradijslaan (Alphen a/d Rijn),20131022,20140718,BETAALDP
1,484,BETAALDP,Betaald Parkeren,20130919,29991231,N,484.0,PARKEREN,Alphen aan den Rijn,20130919,...,20170211151655,,0,0.0,,484_CPGEM,Carpool Gemeneweg (Hazerswoude-Dorp),20150505,20161231,CARPOOL
2,484,BETAALDP,Betaald Parkeren,20130919,29991231,N,484.0,PARKEREN,Alphen aan den Rijn,20130919,...,20170211151655,,0,0.0,,484_CPNIE,Carpool Nieuwkoopseweg (Aarlanderveen),20150505,20161231,CARPOOL
3,484,BETAALDP,Betaald Parkeren,20130919,29991231,N,484.0,PARKEREN,Alphen aan den Rijn,20130919,...,20170211151655,,0,0.0,,484_KEZN,Blauwe Zone Kerk en Zanen (Alphen a/d Rijn),20131022,20140718,BLAUWEZ
4,484,BETAALDP,Betaald Parkeren,20130919,29991231,N,484.0,PARKEREN,Alphen aan den Rijn,20130919,...,20170211151655,,0,0.0,,C_VLAMING,Vergunningen Cornelis de Vlamingstraat,20160101,20160102,VERGUNP
5,484,BETAALDP,Betaald Parkeren,20130919,29991231,N,484.0,PARKEREN,Alphen aan den Rijn,20130919,...,20170211151655,,0,0.0,,AIDAPLEIN,Vergunningen Aidaplein,20160101,20160102,VERGUNP
6,484,BETAALDP,Betaald Parkeren,20130919,29991231,N,484.0,PARKEREN,Alphen aan den Rijn,20130919,...,20170211151655,,0,0.0,,CARMENPL,Vergunningen Carmenplein,20160101,20160102,VERGUNP
7,484,BETAALDP,Betaald Parkeren,20130919,29991231,N,484.0,PARKEREN,Alphen aan den Rijn,20130919,...,20170211151655,,0,0.0,,LAGE_ZIJDE,Vergunningen Lage Zijde,20160101,20160102,VERGUNP
8,484,BETAALDP,Betaald Parkeren,20130919,29991231,N,484.0,PARKEREN,Alphen aan den Rijn,20130919,...,20170211151655,,0,0.0,,H-L_ZIJDE,Vergunningen Bedrijven Hoge / Lage zijde,20160101,20160102,VERGUNP
9,484,BETAALDP,Betaald Parkeren,20130919,29991231,N,484.0,PARKEREN,Alphen aan den Rijn,20130919,...,20170211151655,,0,0.0,,HOGE_ZIJDE,Vergunningen Hoge zijde,20160101,20160102,VERGUNP


Als laatst zag ik dat er meerdere Id's waren met *y* en *x*. Ik heb die veranderd naar betere benaming voor het type data. Ook heb ik de kolommen die ik niet ga gebruiken verwijderd.

Finally I saw that there were multiple ids with *y* and *x*. I chanced their name to better describe their data type and also removed columns I won't use anymore.

In [5]:
rename_col = {"UsageId_y": "UsageType_Id", "AreaId_x": "AreaZone_Id"}
remove_col = ["SpecificationIndicator", "SuperiorAreaManagerId", "StartDateAreaManagerId", "EndDateAreaManagerId", "StartOfPeriod", "EndOfPeriod", "StartDateArea", "EndDateArea", "AreaId_y", "UsageId_x", "PeriodName", "URL"]

all_data = all_data.drop(columns=remove_col, axis=1)
all_data = all_data.rename(columns=rename_col)

all_data["ExitPossibleAllDay"] = all_data["ExitPossibleAllDay"].replace([1, 0],[True, False])
all_data["OpenAllYear"] = all_data["OpenAllYear"].replace([1.0, 0.0],[True, False]).fillna(False)

In [6]:
all_data.tail(10)

Unnamed: 0,AreaManagerId,UsageIdDesc,StartDateUsageId,EndDateUsageId,SuperiorUsageId,AreaManagerDesc,AreaZone_Id,ExitPossibleAllDay,OpenAllYear,AreaDesc,UsageType_Id
1042978,612,Garage Parkeren,20131010,20201019,PARKEREN,Spijkenisse,612_CPG,True,False,Gildenwijk,BETAALDP
1042979,612,Garage Parkeren,20131010,20201019,PARKEREN,Spijkenisse,612_CPG,True,False,Centrum Oost,BETAALDP
1042980,612,Garage Parkeren,20131010,20201019,PARKEREN,Spijkenisse,612_CPG,True,False,P+R terrein metro-/busstation Spijkenisse Centrum,BETAALDP
1042981,612,Garage Parkeren,20131010,20201019,PARKEREN,Spijkenisse,612_CPG,True,False,Centrum Noord,BETAALDP
1042982,612,Garage Parkeren,20131010,20201019,PARKEREN,Spijkenisse,612_CPG,True,False,Garage Theater (Spijkenisse),GARAGEP
1042983,612,Garage Parkeren,20131010,20201019,PARKEREN,Spijkenisse,612_CPG,True,False,Terrein Van Houtenstraat en Vredehofstraat (ce...,TERREINP
1042984,612,Garage Parkeren,20131010,20201019,PARKEREN,Spijkenisse,612_CPG,True,False,Garage Stadhuis (Spijkenisse),GARAGEP
1042985,612,Garage Parkeren,20131010,20201019,PARKEREN,Spijkenisse,612_CPG,True,False,Garage Kolkplein (Spijkenisse),GARAGEP
1042986,612,Garage Parkeren,20131010,20201019,PARKEREN,Spijkenisse,612_CPG,True,False,Garage Boekenberg (Spijkenisse),GARAGEP
1042987,612,Garage Parkeren,20131010,20201019,PARKEREN,Spijkenisse,612_CPG,True,False,Garage City Plaza (Spijkenisse),GARAGEP


With the data complete I can export it to a csv-file. Now - with the data in a csv-file - I can use and filter out the data I need.

In [7]:
all_data.to_csv("RDW_complete.csv")