-
Notifications
You must be signed in to change notification settings - Fork 4
/
tutorial1.txt
165 lines (115 loc) · 5.38 KB
/
tutorial1.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
Extending Types
===============
This is the next tutorial, see the :doc:`previous <tutorial0>`
Once you have your base transformation up and running it is time to extend the fields. Out of the box the
system delivers various extensions, the full list of :ref:`Field types <field-types>` can be found in the
reference.
Translating IDs
---------------
One of the patterns we see in the evolution with microservices is a move away from fully normalized database
schema's. Modern systems rely more on auto-completion, data is cheaper and probably a lot more reasons to store
full strings instead of IDs. In *data-migrator* this is easily supported for hardcoded values with a small
python function:
.. code-block:: python
M = {
0: "NOT SET",
1: "Hallo",
2: "Hello",
3: "Bonjour"
}
def parse_b(v):
return M.get(int(v), M[0])
class Result(models.Model):
id = models.IntField(pos=0) # keep id
....
# replace ID with value
b = models.StringField(pos=2, parse=parse_b, default="How are you doing?")
Note the values are parsed as string from the CSV reader.``NULL`` is by default translated to ``None``, which
is replaced by the default value and will never see the parse function.
Merging columns
---------------
Another migration pattern is to merge separate (boolean) columns back to a single enumeration column. To support
that use a row parser instead of a single value parser. If no ``pos`` is given, the parser will be row based
instead of a value parsed linked to a single column value:
.. code-block:: python
def employment_type(row):
if row[26] == "1": # contractor
return 'contractor'
elif row[27] == "1": # intern
return 'intern'
else:
return 'perm'
class Result(models.Model):
....
b = models.StringField(parse=employment_type, default="perm")
Dynamic lookups
---------------
At moments one needs to lookup values in the target database. Do not be shy to generate dynamic lookups in the
target database using SELECT statements that run during import into the target database.
.. code-block:: python
class Result(models.Model):
recruiter_uuid = models.StringField(pos=38,
replace=lambda x:'(SELECT uuid FROM `persons` WHERE `mail`=%s limit 0,1)' % x)
This can off course be combined with python based transformations to fix deleted values:
.. code-block:: python
def recruiter(v):
if v is None or v in ['missing1@mail.com', 'missing2@mail.com']:
return 'default_person@mail.com'
else:
return v
class Result(models.Model):
recruiter_uuid = models.StringField(pos=38, parse=recruiter,
replace=lambda x:'(SELECT uuid FROM `persons` WHERE `mail`=%s limit 0,1)' % x)
The output is a value for the target database as being the input for a query on that target database.
Table lookups
-------------
For larger tables there is support for table driven lookups from external CSV files. It is also possible for
the map (just a key,value) to be ad-hoc generated by other means. *data-migrator* offers a helper function
:func:`~.read_map_from_csv` to read the csv.
.. code-block:: python
from data_migrator.contrib.read import read_map_from_csv
class Result(models.Model):
country = models.MappingField(pos=33, default='NLD',
data_map=read_map_from_csv(f=open('data/country.csv'), delimiter=';', key='country_id', value='alpha3'))
Combining table lookups
-----------------------
The table lookup and column reduction can also be combined. Consider a multi state entity with specific lookup
values to be merged in one lookup value:
.. code-block:: python
from data_migrator.contrib.read import read_map_from_csv
LOOKUP1 = read_map_from_csv(f=open('data/state1.csv'), delimiter=';', key='id', value='name')
LOOKUP2 = read_map_from_csv(f=open('data/state2.csv'), delimiter=';', key='id', value='name')
def parse_lookup(row):
return LOOKUP1.get(row[1], LOOKUP2.get(row[2], ''))
class Result(models.Model):
...
state = models.StringField(parse=parse_lookup)
Flatten multi values
--------------------
The most extensive many-2-many flattening is for example a tagging of multiple values to a main entity.
This is mostly implemented in a 3 table structure, following the classic normalization approach:
* A table with the main entity (for example persons)
* a table with the attributes in a fixed id,value structure and last
* a many-to-many table linking the attributes to the main entities.
A simple approach to flatten this is to encode this a JSON list, to transform the data use a four step
approach:
#. Extract the data from the old system fully expanded
#. Read the CSV and flatten to a map of lists
#. Link the values at read time replacing the main ID with lists
#. Emit the whole as a JSON list
The first step relies on queries like:
.. code-block:: sql
SELECT
P.person_id,
S.name as skill
FROM person_skill_m2m P
INNER JOIN skills S
ON S.id=P.skill_id;
After that, loading and emitting to JSON is simply using the :class:`~.MappingField`
.. code-block:: python
from data_migrator.contrib.read import read_map_from_csv
class Result(models.Model):
skills = models.MappingField(pos=0, default=[], as_json=True,
data_map=read_map_from_csv(f=open('results/skill.csv'), key='candidate_id', value='skill', as_list=True))
Now take these examples and mix your own. It is standard Python, we have no doubt you can come up with all kinds
of amazing transformations.