forked from mikeckennedy/talk-python-transcripts
-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path002_jesse_davis_talk_python_to_me_64.vtt
1481 lines (987 loc) · 42.9 KB
/
002_jesse_davis_talk_python_to_me_64.vtt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
WEBVTT
00:00:00.001 --> 00:00:09.760
Talk Python To Me. Episode number two with guest Jesse Davis recorded Sunday, April 5th, 2015.
00:00:09.760 --> 00:00:39.700
Hello and welcome to Talk Python To Me, a weekly
00:00:39.700 --> 00:00:45.020
podcast on Python, the language, the libraries, the ecosystem, and the personalities. This is your
00:00:45.020 --> 00:00:50.460
host, Michael Kennedy. Follow me on Twitter where I'm @mkennedy and keep up with the show and listen
00:00:50.460 --> 00:00:56.880
to past episodes at talkpythontome.com. This episode, we'll be talking to Jesse Davis from
00:00:56.880 --> 00:01:02.320
MongoDB about PyMongo and of course, MongoDB. Before we get to the interview, I have a quick
00:01:02.320 --> 00:01:08.280
message to share. Since we launched a week ago, the response has been overwhelming. I've received many
00:01:09.640 --> 00:01:15.100
feedback. I want to thank everyone who contacted the show. However, I could use your help to make
00:01:15.100 --> 00:01:19.980
sure the show continues to grow and thrive. If you know someone who would be interested in listening to
00:01:19.980 --> 00:01:26.340
the show, please send them a link to talkpythontome.com or share this on Twitter or Facebook. Do you know of
00:01:26.340 --> 00:01:30.920
someone who would make a great guest or have a great show topic in mind? Send me a note and I'll set it up.
00:01:30.920 --> 00:01:36.980
In other excellent news, we have a show sponsor. I want to thank Python Gear from pythongear.com for
00:01:36.980 --> 00:01:42.700
sponsoring this episode and you'll hear more about them later. If you'd like to sponsor a future episode,
00:01:42.700 --> 00:01:48.140
please contact us at talkpythontome.com slash sponsor. Now onto the show.
00:01:48.140 --> 00:01:56.780
Let me introduce Jesse. Jesse Davis is a staff engineer at MongoDB in New York City. He works
00:01:56.780 --> 00:02:04.440
on the MongoDB driver team and develops PyMongo and the MongoC driver. He's the author of the async
00:02:04.440 --> 00:02:09.220
MongoDB driver called Motor and he contributes to Tornado and AsyncIO.
00:02:09.220 --> 00:02:11.560
Jesse, welcome to the show.
00:02:11.560 --> 00:02:13.140
Thanks, Michael.
00:02:13.140 --> 00:02:17.840
It's really great to have you here on the show. And, you know, we've known each other
00:02:17.840 --> 00:02:22.820
sort of as acquaintances for a couple of years. As you know, I'm a MongoDB master,
00:02:23.080 --> 00:02:31.200
which is kind of like an MVP community expert program you guys run. So yearly, we'll come up there and we'll have some really interesting conversations.
00:02:31.200 --> 00:02:40.580
And we've always enjoyed the sessions where you come down and talk to sort of the external experts about working with MongoDB from Python.
00:02:40.580 --> 00:02:58.260
Yeah, we've been doing those about once a year and we've got the next one coming up in a month. And I really look forward to those too. We get some of our best ideas. It definitely creates a year's worth of ideas, if not more, to kind of mull over and implement after each one of those sessions.
00:02:58.760 --> 00:03:10.080
Yeah, those are really fantastic meetings. I really, really enjoy them. So I've seen that you've done a ton of stuff with Python. You know, before we get into the details of MongoDB and PyMongo and all that, you know, how'd you get started?
00:03:10.760 --> 00:03:29.760
So it's a funny story, really, as they say. I began when I graduated from Oberlin College 15 years ago, I was a C++ guy, to the extent that I knew any programming language, really, particularly well at the age of 22.
00:03:29.760 --> 00:03:59.740
I thought that I was a C++ and graphics guy.
00:03:59.740 --> 00:04:00.740
flight patterns.
00:04:00.740 --> 00:04:03.640
Wow, that sounds like a really interesting thing to jump into.
00:04:03.640 --> 00:04:29.720
It was a great gig and I did really, really poorly at it. So I spent about two years there and realized that I wasn't yet a grown-up. I was really screwing up my life and I was a bad software engineer. So I quit the job and went out into the world to try to get my head straight. And I spent a summer biking through France.
00:04:29.720 --> 00:04:44.100
And then I spent a year at a Zen monastery in Southern California. And when I checked back with Austin Digital whether they wanted me to come back to work for them, they said no, because I had not proven myself there.
00:04:44.100 --> 00:05:02.920
So I came to New York to continue my Zen study with a place called the Village Zendo here and to start being a layman and a software professional again. And there were no C++ jobs in New York at the time.
00:05:02.920 --> 00:05:03.920
What kind of jobs were there?
00:05:03.920 --> 00:05:04.920
What kind of jobs were there?
00:05:04.920 --> 00:05:30.920
Well, this was fall of 2004. So the whole market was kind of in bad shape. There had been the NASDAQ crash in July of 2001 and then there was September 11th. And New York hadn't recovered from that yet. So all of the C++ people from all of the banks were still unemployed and I couldn't compete with them.
00:05:30.920 --> 00:05:53.820
So what I did find was an educational startup called Wireless Generation in Brooklyn. In recent years, it's become Amplify Education. And they were willing to take a shot at me, even though the job was in Python and Oracle. And I didn't know either of those.
00:05:53.820 --> 00:05:56.420
That's a long ways from graphics and C++.
00:05:56.420 --> 00:06:06.300
Yeah, it was a huge leap. And it was pretty tough, but I had good mentors and I started using Python there.
00:06:06.300 --> 00:06:18.400
That's excellent. So then you carried on digging into Python and I saw that you have a ton of open source projects on GitHub that are successful or contribute to them.
00:06:19.240 --> 00:06:22.120
And somehow you found your way over to MongoDB from there, huh?
00:06:22.120 --> 00:06:41.320
Yeah. So after a few years working for Wireless Generation, I wanted to get a little more breadth. And I also wanted to make sure that I didn't become so senior that I couldn't continue to program, which was a pressure that I experienced there to move into management.
00:06:41.440 --> 00:06:47.760
That's kind of the curse of success for some programmers is, you know, you're really good at this. Stop doing it now. Go manage people, right?
00:06:47.760 --> 00:06:59.440
Yeah, exactly. And so to jump ahead a little bit, now at MongoDB, we've figured out how to do that by creating this whole separate track of staff engineers, which is the track I'm on now.
00:06:59.800 --> 00:07:29.780
Oh, that's excellent.
00:07:29.780 --> 00:07:32.500
Data storage layer for applications like that.
00:07:32.500 --> 00:07:38.040
Even though MongoDB was brand new at the time, like I started using it at version like 0.8 or something.
00:07:38.040 --> 00:07:42.120
Yeah, is this like 2009, 2010 timeframe or something like that?
00:07:42.120 --> 00:07:43.940
Yeah, exactly.
00:07:44.500 --> 00:07:48.840
And it was such a cool product.
00:07:48.840 --> 00:08:12.020
And within the New York tech startup scene, it was such a rarity as a big infrastructure systems project in a New York City startup that when I finally got tired of freelancing and I wanted to settle down and make a substantial contribution to a single product, I called Elliot and I said, I'm ready to come in from the cold.
00:08:12.020 --> 00:08:12.720
And he said, great.
00:08:12.720 --> 00:08:13.960
That's excellent.
00:08:13.960 --> 00:08:16.740
That's Elliot Horowitz, who's the CTO of MongoDB, right?
00:08:16.740 --> 00:08:18.220
Yeah, exactly.
00:08:18.220 --> 00:08:18.620
Yeah.
00:08:18.620 --> 00:08:26.580
So I suspect most people who are listening to this show have heard of MongoDB, although maybe not everybody.
00:08:26.580 --> 00:08:29.100
And they might maybe just know it as a buzzword.
00:08:29.100 --> 00:08:32.200
Can you give us the quick elevator pitch of what MongoDB is?
00:08:32.440 --> 00:08:32.840
Sure.
00:08:32.840 --> 00:08:32.840
Sure.
00:08:32.840 --> 00:08:36.020
So it stores your data.
00:08:36.020 --> 00:08:37.060
It's a database.
00:08:37.060 --> 00:08:45.620
And it stores your data not in rows and columns, but in a non-relational document format.
00:08:45.620 --> 00:08:49.780
And the format is called BSON, which is a binary JSON format.
00:08:50.780 --> 00:08:56.040
So if you know JSON, MongoDB's data format is very familiar.
00:08:56.040 --> 00:09:00.740
It consists of objects, which have a set of key value pairs.
00:09:00.740 --> 00:09:11.440
And these documents can also contain arrays, strings, numbers, dates, and about a dozen primitive data types.
00:09:11.440 --> 00:09:19.760
MongoDB lets you index and query this kind of object-oriented data in a very rich way.
00:09:20.180 --> 00:09:36.380
So among the document databases that we compete with, we have a particular advantage when it comes to our ability to declare multiple indexes on a collection,
00:09:36.380 --> 00:09:42.720
the sophistication of our query language and our statistical aggregation capabilities,
00:09:42.920 --> 00:09:52.680
and our ability to let you do very complex update operations where you can add a member to a set within a document
00:09:52.680 --> 00:09:56.420
or do math on numbers within those documents.
00:09:56.420 --> 00:10:02.480
One of the things that I find people coming from a relational database world feel like,
00:10:02.480 --> 00:10:06.660
a lot of times they're like, well, it's really cool you can have these kind of hierarchical structures
00:10:06.660 --> 00:10:12.920
that more closely match the way your objects look in memory in your program.
00:10:12.920 --> 00:10:16.960
But you probably can't query properly deep down with this stuff.
00:10:16.960 --> 00:10:19.860
So if I've got, let's take a super simple example, like a bookstore,
00:10:19.860 --> 00:10:25.180
and the bookstore has books and the books have reviews as nested, like a nested array.
00:10:25.180 --> 00:10:29.820
Well, what if I just want to know all the books that have five-star reviews?
00:10:29.820 --> 00:10:31.240
Could I query that?
00:10:31.240 --> 00:10:32.160
Right, exactly.
00:10:32.160 --> 00:10:41.920
And we do provide that, and that distinguishes us somewhat from the much simpler sort of key value store
00:10:41.920 --> 00:10:45.940
or other simplified document database product.
00:10:45.940 --> 00:10:47.260
Yeah, definitely.
00:10:47.260 --> 00:10:50.360
And I think, you know, of all the NoSQL databases,
00:10:50.360 --> 00:10:55.160
MongoDB is one of the few that would reasonably be something you could consider
00:10:55.160 --> 00:11:01.740
as your standard general purpose database, not just some kind of high-scale special use case.
00:11:01.740 --> 00:11:02.980
Yeah, that's exactly right.
00:11:02.980 --> 00:11:07.540
And MongoDB is not the best answer to every single question, obviously.
00:11:07.540 --> 00:11:11.240
There is data that is naturally relational,
00:11:11.240 --> 00:11:17.060
and then there is data that should naturally be put in some other simpler,
00:11:17.060 --> 00:11:19.180
more specialized NoSQL database.
00:11:19.180 --> 00:11:24.780
But MongoDB is very much targeted to be the best answer to many questions
00:11:24.780 --> 00:11:28.760
questions and a pretty good answer to an even broader set of questions.
00:11:28.760 --> 00:11:35.860
So you can use it as your default database in the way that you might have in the past been used to,
00:11:35.860 --> 00:11:41.620
MySQL or Postgres being a pretty good answer to many questions and the best answer to many others.
00:11:42.580 --> 00:11:45.320
This episode is sponsored by Python Gear.
00:11:45.320 --> 00:11:47.520
We know you're a huge fan of Python,
00:11:47.520 --> 00:11:52.180
and Python Gear has an excellent way to put your enthusiasm for Python on display.
00:11:52.180 --> 00:11:57.940
Visit pythongear.com and pick up Python or Django t-shirts, stickers, and more.
00:11:57.940 --> 00:12:00.420
Hand screen printed on American apparel,
00:12:00.420 --> 00:12:03.460
these are shirts that are made to last and are very comfortable.
00:12:04.120 --> 00:12:08.660
What's more, a portion of all sales will benefit either the Python Software Foundation
00:12:08.660 --> 00:12:10.580
or the Django Software Foundation.
00:12:10.580 --> 00:12:13.680
Tell Python Gear thank you for sponsoring this podcast
00:12:13.680 --> 00:12:17.360
by visiting their site at pythongear.com and ordering a t-shirt.
00:12:17.360 --> 00:12:20.100
They're also helping us with a small contest.
00:12:20.100 --> 00:12:23.680
We're giving away a free t-shirt to one lucky listener.
00:12:23.680 --> 00:12:27.880
Visit talkpythontimi.com, click on Friends of the Show,
00:12:27.880 --> 00:12:31.280
enter your email address, and we'll pick a winner before the next episode.
00:12:31.280 --> 00:12:32.980
Now, back to the show.
00:12:32.980 --> 00:12:38.940
So, I've got MongoDB, and by the way, in case people didn't know,
00:12:38.940 --> 00:12:42.180
it's open source, you can go to GitHub and check it out or see the progress.
00:12:42.180 --> 00:12:46.480
I've gotten it, but probably I downloaded it from mongodb.org,
00:12:46.480 --> 00:12:48.000
and it's running.
00:12:48.000 --> 00:12:49.340
Now I've got my Python app.
00:12:49.340 --> 00:12:50.220
What do I do?
00:12:50.220 --> 00:12:50.960
Right.
00:12:50.960 --> 00:12:56.780
So, MongoDB's network protocol is called the MongoDB Wire Protocol,
00:12:56.780 --> 00:13:01.060
and it's a basic TCP protocol.
00:13:01.060 --> 00:13:04.160
So, you need something that knows how to talk that protocol
00:13:04.160 --> 00:13:10.120
and knows how to convert between your Python data structures,
00:13:10.120 --> 00:13:13.020
your dicts and lists and strings and numbers,
00:13:13.020 --> 00:13:15.240
to BSON and back.
00:13:15.240 --> 00:13:16.440
So, you need a driver.
00:13:16.440 --> 00:13:21.640
And the standard driver for MongoDB is called PyMongo,
00:13:21.980 --> 00:13:26.120
and you install it from PyPI via PIP, install PyMongo.
00:13:26.120 --> 00:13:32.120
The current version is about to be 3.0,
00:13:32.120 --> 00:13:34.280
which we'll release in just about a week,
00:13:34.280 --> 00:13:35.940
which is very exciting.
00:13:35.940 --> 00:13:37.860
Yeah, that's big news.
00:13:37.860 --> 00:13:41.700
Like, you guys have been trying to have a sort of a major unification
00:13:41.700 --> 00:13:43.760
of all the different drivers for the different languages.
00:13:44.080 --> 00:13:45.080
Is this part of that effort?
00:13:45.080 --> 00:13:46.400
Yeah, that's exactly right.
00:13:46.400 --> 00:13:51.700
So, PyMongo 3 has big behavioral and API improvements
00:13:51.700 --> 00:13:52.800
and standardizations,
00:13:52.800 --> 00:14:00.140
and that those changes are matched by the MongoDB Ruby driver 2.0,
00:14:00.140 --> 00:14:02.080
the C driver 1.2,
00:14:02.080 --> 00:14:03.600
the nose driver 2.0,
00:14:03.720 --> 00:14:04.720
and so on.
00:14:04.720 --> 00:14:08.240
And much more than ever before,
00:14:08.240 --> 00:14:11.760
we are all converging on the same set of behaviors
00:14:11.760 --> 00:14:13.040
and the same set of APIs.
00:14:13.040 --> 00:14:14.140
That's really cool.
00:14:14.140 --> 00:14:16.100
One of the real benefits of Mongo, I think,
00:14:16.100 --> 00:14:18.060
is it has great support for so many languages.
00:14:18.060 --> 00:14:20.040
So, if you choose your database,
00:14:20.040 --> 00:14:22.820
you're like, oh, wait, maybe this is better from, you know,
00:14:22.820 --> 00:14:23.980
Java for some reason.
00:14:23.980 --> 00:14:26.600
It still has a good data access story.
00:14:26.600 --> 00:14:27.580
So, that's fantastic.
00:14:27.580 --> 00:14:28.620
That's getting even better.
00:14:28.620 --> 00:14:29.620
Yeah, right.
00:14:29.620 --> 00:14:30.620
That's exactly right.
00:14:30.620 --> 00:14:35.360
So, we have drivers in 10 programming languages,
00:14:35.360 --> 00:14:41.240
and plus, even if you're using something weird like R or Haskell or Erlang,
00:14:41.240 --> 00:14:43.320
there's something out there in the community for you
00:14:43.320 --> 00:14:46.700
because writing a basic driver is actually fairly easy.
00:14:46.700 --> 00:14:50.440
We're really focused on making sure that each of these drivers
00:14:50.440 --> 00:14:54.260
feels right to experts in that language.
00:14:54.260 --> 00:14:56.420
So, PyMongo is very Pythonic,
00:14:56.420 --> 00:14:59.000
and it's written by Python experts,
00:14:59.540 --> 00:15:02.440
and its style and its documentation and so on
00:15:02.440 --> 00:15:04.280
are all very Python-y
00:15:04.280 --> 00:15:06.020
while at the same time balancing that
00:15:06.020 --> 00:15:07.320
with some degree of consistency
00:15:07.320 --> 00:15:10.560
with the nine other programming languages that we support.
00:15:10.560 --> 00:15:13.120
Yeah, there's got to be some interesting tension there.
00:15:13.120 --> 00:15:14.780
Huge, huge tension.
00:15:14.780 --> 00:15:17.260
It's the toughest problem that we face,
00:15:17.260 --> 00:15:19.340
and we are just in the last year or two
00:15:19.340 --> 00:15:22.640
really figuring out good bits to tackle that
00:15:22.640 --> 00:15:24.660
and to make those decisions correctly.
00:15:24.660 --> 00:15:25.820
Yeah, cool.
00:15:25.820 --> 00:15:28.640
So, you play a pretty big part in PyMongo, right?
00:15:28.820 --> 00:15:31.600
Yeah, I've been Bernie Hackett.
00:15:31.600 --> 00:15:36.840
My boss in Palo Alto is the PyMongo developer and maintainer,
00:15:36.840 --> 00:15:40.240
and I've been assisting him for the last three years
00:15:40.240 --> 00:15:41.960
as his second-in-command.
00:15:42.840 --> 00:15:48.820
And my main contributions to the driver are its concurrency design,
00:15:48.820 --> 00:15:55.880
its implementation of distributed systems type problem solving,
00:15:55.880 --> 00:15:58.980
and the connection pool.
00:15:58.980 --> 00:16:05.720
And with the 3.0 release, that's actually kind of done for the moment.
00:16:06.960 --> 00:16:11.300
And so I'm putting a lot of that work to rest now
00:16:11.300 --> 00:16:16.520
and moving on to become the primary maintainer of the C driver for MongoDB
00:16:16.520 --> 00:16:21.120
so that that part of the team can move into the kernel team
00:16:21.120 --> 00:16:22.380
and make contributions there.
00:16:22.380 --> 00:16:23.640
Oh, that's excellent.
00:16:23.640 --> 00:16:26.700
And that's even a little bit back to your roots from Austin,
00:16:26.700 --> 00:16:29.720
in some ways, I guess, right, with the C++ story.
00:16:30.740 --> 00:16:31.760
Yeah, exactly.
00:16:31.760 --> 00:16:37.740
Parts of my brain that has been idle for a decade are coming back online,
00:16:37.740 --> 00:16:39.620
and it's a really fun feeling.
00:16:39.620 --> 00:16:40.580
I know it is.
00:16:40.580 --> 00:16:41.660
Yeah, yeah.
00:16:41.660 --> 00:16:46.620
If audience members, if you've been programming Python for 10 years straight,
00:16:46.620 --> 00:16:49.840
like I have, I really can't recommend enough
00:16:49.840 --> 00:16:53.580
learning a very different programming language or reviving one.
00:16:53.680 --> 00:16:55.720
It's incredibly satisfying.
00:16:55.720 --> 00:16:59.440
Yeah, and gives you interesting problem solving skills
00:16:59.440 --> 00:17:03.240
that you don't necessarily develop if you stay in just one language.
00:17:03.240 --> 00:17:03.740
So that's great.
00:17:03.740 --> 00:17:04.380
Yeah.
00:17:04.380 --> 00:17:07.040
Now, there's a bunch of ways to talk to MongoDB,
00:17:07.040 --> 00:17:08.700
even just from Python, right?
00:17:08.700 --> 00:17:10.380
So there's PyMongo.
00:17:10.380 --> 00:17:11.420
What else is there?
00:17:11.420 --> 00:17:15.780
So PyMongo is the general purpose driver,
00:17:15.780 --> 00:17:18.280
and it's the most featureful, the most standard,
00:17:18.580 --> 00:17:19.700
the best maintained,
00:17:19.700 --> 00:17:26.800
but it's not optimized for some specialized use cases.
00:17:26.800 --> 00:17:33.800
And you can think of these as CPU-bound versus I.O.-bound use cases.
00:17:33.800 --> 00:17:34.700
Right, okay.
00:17:34.700 --> 00:17:36.820
So for the I.O.-bound,
00:17:36.820 --> 00:17:39.860
cases where you've got a web application
00:17:39.860 --> 00:17:43.320
that has a huge number of client connections,
00:17:43.320 --> 00:17:46.780
but they're often kind of idle or sleepy connections,
00:17:46.780 --> 00:17:50.500
like if you're implementing a chat server
00:17:50.500 --> 00:17:52.400
or something with web sockets,
00:17:52.400 --> 00:17:56.880
you want to use an async framework in Python,
00:17:56.880 --> 00:17:59.320
like Tornado or Twisted
00:17:59.320 --> 00:18:03.320
or in the Python 3.4 standard library,
00:18:03.320 --> 00:18:04.820
we've got async.io now.
00:18:04.820 --> 00:18:06.280
Yeah, that's a cool new feature.
00:18:06.280 --> 00:18:09.800
Right, so these are awesome async frameworks,
00:18:09.800 --> 00:18:11.760
and they solve that problem brilliantly,
00:18:11.760 --> 00:18:14.980
but they've got a gigantic compatibility issue.
00:18:16.020 --> 00:18:18.820
none of the existing libraries work with them.
00:18:18.820 --> 00:18:24.060
None of the existing sort of outside MongoDB libraries don't work with them,
00:18:24.060 --> 00:18:26.580
or like PyMongo itself doesn't work with them,
00:18:26.580 --> 00:18:27.140
or how do you mean?
00:18:27.140 --> 00:18:28.640
Well, I mean both of those.
00:18:28.640 --> 00:18:33.660
So if you've got basically a driver for any database
00:18:33.660 --> 00:18:36.980
that's not written specifically for one of these async frameworks,
00:18:36.980 --> 00:18:40.140
then it won't work with that async framework.
00:18:41.140 --> 00:18:44.780
So you need a specialized database driver for Tornado and MySQL.
00:18:44.780 --> 00:18:49.840
You need a specialized database driver for Tornado and Postgres.
00:18:49.840 --> 00:18:51.300
Right.
00:18:51.300 --> 00:18:54.380
And you need a specialized driver for Tornado and MongoDB.
00:18:55.100 --> 00:18:57.640
So I wrote that over the last few years,
00:18:57.640 --> 00:18:58.620
and it's called Motor,
00:18:58.620 --> 00:19:02.680
because it's taking the beginning of Mongo and Tornado.
00:19:02.680 --> 00:19:03.280
Excellent.
00:19:03.280 --> 00:19:04.540
Right.
00:19:04.540 --> 00:19:05.880
Plus it's a cool name,
00:19:05.880 --> 00:19:07.920
and somehow it was not yet taken on PyPI.
00:19:08.460 --> 00:19:15.920
So Motor is the now standard official async driver for MongoDB and Tornado,
00:19:15.920 --> 00:19:21.900
and over the next year I'm going to be expanding it out to cover async.io next,
00:19:21.900 --> 00:19:23.880
and then eventually twist it as well,
00:19:23.880 --> 00:19:26.620
so that it will just integrate with whatever you're using right now.
00:19:26.620 --> 00:19:27.120
Nice.
00:19:27.120 --> 00:19:29.140
And does that work with Python 3 and 2,
00:19:29.140 --> 00:19:30.940
or is that sort of a 2 thing for now,
00:19:30.940 --> 00:19:31.860
or what's the story there?
00:19:32.160 --> 00:19:35.480
That will work with Python 2.6 plus,
00:19:35.480 --> 00:19:41.320
so 2.6, 2.7, 3.3, 3.4, 3.5, when 3.5 is out.
00:19:41.320 --> 00:19:44.240
So how does that work with different implementations,
00:19:44.240 --> 00:19:45.400
like PyPI, for example?
00:19:45.400 --> 00:19:46.060
Sure.
00:19:46.060 --> 00:19:50.560
In the past, Motor and PyPI didn't work together very well,
00:19:50.560 --> 00:19:53.500
but that was about a year ago that I last personally tested them.
00:19:53.500 --> 00:19:56.580
It was correct, but it was slow,
00:19:56.580 --> 00:20:00.780
due to some very specific details about PyPI.
00:20:01.800 --> 00:20:03.060
In recent months,
00:20:03.060 --> 00:20:06.460
somebody that I didn't know posted benchmarks
00:20:06.460 --> 00:20:12.520
that showed that Tornado, Motor, and PyPI were actually blazingly fast,
00:20:12.520 --> 00:20:14.460
but I haven't personally reproduced that,
00:20:14.460 --> 00:20:18.820
so at the moment it's just kind of a hopeful sign
00:20:18.820 --> 00:20:20.940
rather than something that I would officially endorse.
00:20:20.940 --> 00:20:21.340
Sure.
00:20:21.340 --> 00:20:22.440
That's really good news, though.
00:20:22.440 --> 00:20:25.740
It looks like PyPI is moving on and has a lot of activity there,
00:20:25.740 --> 00:20:26.760
so that's really cool.
00:20:26.760 --> 00:20:27.520
I agree.
00:20:27.520 --> 00:20:27.980
Yeah.
00:20:27.980 --> 00:20:29.960
I've also heard of something called Mon Area.
00:20:29.960 --> 00:20:30.460
What's that?
00:20:30.460 --> 00:20:31.080
Right.
00:20:31.220 --> 00:20:34.160
So we've got this other branch of specialization.
00:20:34.160 --> 00:20:39.480
So the three categories that I think of are general purpose,
00:20:39.480 --> 00:20:41.320
PyO bound, and that's for Motor,
00:20:41.320 --> 00:20:45.520
and then there's CPU bound, and that's what Monary is for.
00:20:45.520 --> 00:20:50.280
Monary is a NumPy driver for MongoDB.
00:20:50.280 --> 00:20:51.560
Oh, that's interesting.
00:20:51.560 --> 00:20:52.040
Wow.
00:20:52.040 --> 00:20:52.960
Isn't it?
00:20:53.480 --> 00:20:55.480
I found out about it a few years ago.
00:20:55.480 --> 00:21:00.000
It was written by a quantitative analyst named David Beach.
00:21:00.000 --> 00:21:05.100
He needed it for something, some specific financial application that he was doing.
00:21:05.100 --> 00:21:13.900
And he noticed that if you stream BSON data through PyMongo and then into NumPy from there,
00:21:13.900 --> 00:21:15.400
it's pretty slow.
00:21:15.660 --> 00:21:18.820
And your data conversion is typically your bottleneck.
00:21:18.820 --> 00:21:20.240
MongoDB is fast.
00:21:20.240 --> 00:21:21.080
NumPy is fast.
00:21:21.080 --> 00:21:27.760
But converting each number from one data format to the next is very expensive,
00:21:27.760 --> 00:21:29.220
and it's a lot of wasted work.
00:21:29.920 --> 00:21:38.840
So he wrote a little bit of C code, which queries MongoDB using the C driver rather than PyMongo,
00:21:38.840 --> 00:21:46.760
converts the BSON data directly into NumPy arrays without passing through any Python data structure,
00:21:46.760 --> 00:21:52.000
and then hands you giant buffers of numbers that it got from MongoDB.
00:21:52.660 --> 00:21:59.420
And then you can use NumPy's incredibly fast statistical methods on that data.
00:21:59.420 --> 00:22:00.760
That's really fantastic.
00:22:00.760 --> 00:22:06.740
So if maybe you're storing a bunch of data in Mongo for big data on numerical type stuff,
00:22:06.740 --> 00:22:09.220
this would be the thing for you from Python.
00:22:09.220 --> 00:22:10.200
Exactly.
00:22:10.200 --> 00:22:14.240
And that's a pretty common use case among financial institutions.
00:22:14.240 --> 00:22:19.360
There's also a lot of universities doing big bioinformatics with MongoDB.
00:22:20.720 --> 00:22:28.100
And there's generally a lot of use within the scientific community for storing numeric data in MongoDB.
00:22:28.100 --> 00:22:37.620
Since NumPy has such a rich set of statistical routines that you can just take off the shelf,
00:22:37.620 --> 00:22:44.400
being able to go between the two in Python incredibly fast is an awesome feature.
00:22:44.400 --> 00:22:50.420
So Monary can do upwards of a million queries per second on commodity hardware.
00:22:50.420 --> 00:22:55.260
And, or query upwards of a million documents per second.
00:22:55.260 --> 00:22:56.800
That's amazing.
00:22:56.800 --> 00:23:00.520
We've been adding features to it over the years.
00:23:00.520 --> 00:23:04.800
We had a couple of interns last summer, Matt Cotter and Kyle Suarez.
00:23:04.800 --> 00:23:10.060
And now a new hire who's working for me, Anna Hurley,
00:23:10.060 --> 00:23:13.860
is adding more and more features to Monary every month.
00:23:13.960 --> 00:23:15.380
So it's becoming better documented.
00:23:15.380 --> 00:23:17.300
It's now read-write.
00:23:17.300 --> 00:23:21.040
So you can insert some NumPy arrays into MongoDB.
00:23:22.180 --> 00:23:24.900
And it's similarly optimized along that path.
00:23:24.900 --> 00:23:32.560
And we're adding SSL and authentication, which financial institutions will probably want if they're analyzing financial data.
00:23:32.560 --> 00:23:47.760
And that's kind of, that fills in the other portion of the environment where you're doing single-threaded CPU-bound calculations on numeric data within MongoDB.
00:23:48.300 --> 00:23:53.820
Yeah, that really does open up the whole science story for MongoDB a little bit more from Python anyway.
00:23:53.820 --> 00:23:54.480
That's really cool.
00:23:54.480 --> 00:24:01.720
So what surprises you most about what you see people doing with Mongo from Python or even with PyMongo specifically?
00:24:01.720 --> 00:24:08.880
What most often surprises me is there are certain mistakes that are incredibly common.
00:24:09.200 --> 00:24:11.480
And I wish we could figure out how to stamp them out.
00:24:11.480 --> 00:24:22.280
And the main mistake that I see people make is that they create a new Mongo client class instance for every HTTP request.
00:24:23.200 --> 00:24:35.820
And so they pay the price of TCP setup, very often SSL and authentication setup, and then the TCP slow start algorithm.
00:24:35.820 --> 00:24:40.580
All of this incredible overhead involved in opening a socket.
00:24:40.580 --> 00:24:42.900
And then they do one query and shut it all down.
00:24:44.680 --> 00:24:49.160
And when they say, oh, and they defeat connection pooling as well.