-
Notifications
You must be signed in to change notification settings - Fork 5
/
feed.xml
2034 lines (1659 loc) · 224 KB
/
feed.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Rishabh Shukla</title>
<description>A few reads about Machine Learning and Natural Language Processing</description>
<link>http://rishy.github.io//</link>
<atom:link href="http://rishy.github.io//feed.xml" rel="self" type="application/rss+xml" />
<item>
<title>How to train your Deep Neural Network</title>
<description><p>There are certain practices in <strong>Deep Learning</strong> that are highly recommended, in order to efficiently train <strong>Deep Neural Networks</strong>. In this post, I will be covering a few of these most commonly used practices, ranging from importance of quality training data, choice of hyperparameters to more general tips for faster prototyping of DNNs. Most of these practices, are validated by the research in academia and industry and are presented with mathematical and experimental proofs in research papers like <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient BackProp(Yann LeCun et al.)</a> and <a href="https://arxiv.org/pdf/1206.5533v2.pdf">Practical Recommendations for Deep Architectures(Yoshua Bengio)</a>.</p>
<p>As you’ll notice, I haven’t mentioned any mathematical proofs in this post. All the points suggested here, should be taken more of a summarization of the best practices for training DNNs. For more in-depth understanding, I highly recommend you to go through the above mentioned research papers and references provided at the end.</p>
<hr />
<h3 id="training-data">Training data</h3>
<p>A lot of ML practitioners are habitual of throwing raw training data in any <strong>Deep Neural Net(DNN)</strong>. And why not, any DNN would(presumably) still give good results, right? But, it’s not completely old school to say that - “given the right type of data, a fairly simple model will provide better and faster results than a complex DNN”(although, this might have exceptions). So, whether you are working with <strong>Computer Vision</strong>, <strong>Natural Language Processing</strong>, <strong>Statistical Modelling</strong>, etc. try to preprocess your raw data. A few measures one can take to get better training data:</p>
<ul>
<li>Get your hands on as large a dataset as possible(DNNs are quite data-hungry: more is better)</li>
<li>Remove any training sample with corrupted data(short texts, highly distorted images, spurious output labels, features with lots of null values, etc.)</li>
<li>Data Augmentation - create new examples(in case of images - rescale, add noise, etc.)</li>
</ul>
<!-- ### Normalize input vectors
A Deep learning model can converge much faster, if the empirical mean of input vectors lies near `0`. [Efficient BackProp](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) elucidated this in detail. Basically, it boils down to the average polarity(positive/negative) of the product of input vector - `x` and weight - `W`. Hence, during **backpropagation** all of these weights will either decrease or increase; consequently, loss will be optimized in a zig-zag fashion. -->
<h3 id="choose-appropriate-activation-functions">Choose appropriate activation functions</h3>
<p>One of the vital components of any Neural Net are <a href="https://en.wikipedia.org/wiki/Activation_function">activation functions</a>. <strong>Activations</strong> introduces the much desired <strong>non-linearity</strong> into the model. For years, <code class="highlighter-rouge">sigmoid</code> activation functions have been the preferable choice. But, a <code class="highlighter-rouge">sigmoid</code> function is inherently cursed by these two drawbacks - 1. Saturation of sigmoids at tails(further causing <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">vanishing gradient problem</a>). 2. <code class="highlighter-rouge">sigmoids</code> are not zero-centered.</p>
<p>A better alternative is a <code class="highlighter-rouge">tanh</code> function - mathematically, <code class="highlighter-rouge">tanh</code> is just a rescaled and shifted <code class="highlighter-rouge">sigmoid</code>, <code class="highlighter-rouge">tanh(x) = 2*sigmoid(x) - 1</code>.
Although <code class="highlighter-rouge">tanh</code> can still suffer from the <strong>vanishing gradient problem</strong>, but the good news is - <code class="highlighter-rouge">tanh</code> is zero-centered. Hence, using <code class="highlighter-rouge">tanh</code> as activation function will result into faster convergence. I have found that using <code class="highlighter-rouge">tanh</code> as activations generally works better than sigmoid.</p>
<p>You can further explore other alternatives like <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)"><code class="highlighter-rouge">ReLU</code></a>, <code class="highlighter-rouge">SoftSign</code>, etc. depending on the specific task, which have shown to ameliorate some of these issues.</p>
<h3 id="number-of-hidden-units-and-layers">Number of Hidden Units and Layers</h3>
<p>Keeping a larger number of hidden units than the optimal number, is generally a safe bet. Since, any regularization method will take care of superfluous units, at least to some extent. On the other hand, while keeping smaller numbers of hidden units(than the optimal number), there are higher chances of underfitting the model.</p>
<p>Also, while employing <strong>unsupervised pre-trained representations</strong>(describe in later sections), the optimal number of hidden units are generally kept even larger. Since, pre-trained representation might contain a lot of irrelevant information in these representations(for the specific supervised task). By increasing the number of hidden units, model will have the required flexibility to filter out the most appropriate information out of these pre-trained representations.</p>
<p>Selecting the optimal number of layers is relatively straight forward. As <a href="https://www.quora.com/profile/Yoshua-Bengio">@Yoshua-Bengio</a> mentioned on Quora - “You just keep on adding layers, until the test error doesn’t improve anymore”. ;)</p>
<h3 id="weight-initialization">Weight Initialization</h3>
<p>Always initialize the weights with small <code class="highlighter-rouge">random numbers</code> to break the symmetry between different units. But how small should weights be? What’s the recommended upper limit? What probability distribution to use for generating random numbers? Furthermore, while using <code class="highlighter-rouge">sigmoid</code> activation functions, if weights are initialized to very large numbers, then the sigmoid will <strong>saturate</strong>(tail regions), resulting into <strong>dead neurons</strong>. If weights are very small, then gradients will also be small. Therefore, it’s preferable to choose weights in an intermediate range, such that these are distributed evenly around a mean value.</p>
<p>Thankfully, there has been lot of research regarding the appropriate values of initial weights, which is really important for an efficient convergence.
To initialize the weights that are evenly distributed, a <code class="highlighter-rouge">uniform distribution</code> is probably one of the best choice. Furthermore, as shown in the <a href="http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">paper(Glorot and Bengio, 2010)</a>, units with more incoming connections(fan_in) should have relatively smaller weights.</p>
<p>Thanks to all these thorough experiments, now we have a tested formula that we can directly use for weight initialization; i.e. - weights drawn from <code class="highlighter-rouge">~ Uniform(-r, r)</code> where <code class="highlighter-rouge">r=sqrt(6/(fan_in+fan_out))</code> for <code class="highlighter-rouge">tanh</code> activations, and <code class="highlighter-rouge">r=4*(sqrt(6/fan_in+fan_out))</code> for <code class="highlighter-rouge">sigmoid</code> activations, where <code class="highlighter-rouge">fan_in</code> is the size of the previous layer and <code class="highlighter-rouge">fan_out</code> is the size of next layer.</p>
<h3 id="learning-rates">Learning Rates</h3>
<p>This is probably one of the most important hyperparameter, governing the learning process. Set the learning rate too small and your model might take ages to converge, make it too large and within initial few training examples, your loss might shoot up to sky. Generally, a learning rate of <code class="highlighter-rouge">0.01</code> is a safe bet, but this shouldn’t be taken as a stringent rule; since the optimal learning rate should be in accordance to the specific task.</p>
<p>In contrast to, a fixed learning rate, gradually decreasing the learning rate, after each epoch or after a few thousand examples is another option. Although this might help in faster training, but requires another manual decision about the new learning rates. Generally, <strong>learning rate can be halved after each epoch</strong> - these kinds of strategies were quite common a few years back.</p>
<p>Fortunately, now we have better <code class="highlighter-rouge">momentum based methods</code> to change the learning rate, based on the curvature of the error function. It might also help to set different learning rates for individual parameters in the model; since, some parameters might be learning at a relatively slower or faster rate.</p>
<p>Lately, there has been a good amount of research on optimization methods, resulting into <code class="highlighter-rouge">adaptive learning rates</code>. At this moment, we have numerous options starting from good old <code class="highlighter-rouge">Momentum Method</code> to <code class="highlighter-rouge">Adagrad</code>, <code class="highlighter-rouge">Adam</code>(personal favourite ;)), <code class="highlighter-rouge">RMSProp</code> etc. Methods like <code class="highlighter-rouge">Adagrad</code> or <code class="highlighter-rouge">Adam</code>, effectively save us from manually choosing an <code class="highlighter-rouge">initial learning rate</code>, and given the right amount of time, the model will start to converge quite smoothly(of course, still selecting a good initial rate will further help).</p>
<h3 id="hyperparameter-tuning-spun-grid-search---embrace-random-search">Hyperparameter Tuning: Spun Grid Search - Embrace Random Search</h3>
<p><strong>Grid Search</strong> has been prevalent in classical machine learning. But, Grid Search is not at all efficient in finding optimal hyperparameters for DNNs. Primarily, because of the time taken by a DNN in trying out different hyperparameter combinations. As the number of hyperparameters keeps on increasing, computation required for Grid Search also increases exponentially.</p>
<p>There are two ways to go about it:</p>
<ol>
<li>Based on your prior experience, you can manually tune some common hyperparameters like learning rate, number of layers, etc.</li>
<li>Instead of Grid Search, use <strong>Random Search/Random Sampling</strong> for choosing optimal hyperparameters. A combination of hyperparameters is generally choosen from a <strong>uniform distribution</strong> within the desired range. It is also possible to add some prior knowledge to further decrease the search space(like learning rate shouldn’t be too large or too small). Random Search has been found to be way more efficient compared to Grid Search.</li>
</ol>
<h3 id="learning-methods">Learning Methods</h3>
<p>Good old <strong>Stochastic Gradient Descent</strong> might not be as efficient for DNNs(again, not a stringent rule), lately there have been a lot of research to develop more flexible optimization algorithms. For e.g.: <code class="highlighter-rouge">Adagrad</code>, <code class="highlighter-rouge">Adam</code>, <code class="highlighter-rouge">AdaDelta</code>, <code class="highlighter-rouge">RMSProp</code>, etc. In addition to providing <strong>adaptive learning rates</strong>, these sophisticated methods also use <strong>different rates for different model parameters</strong> and this generally results into a smoother convergence. It’s good to consider these as hyper-parameters and one should always try out a few of these on a subset of training data.</p>
<h3 id="keep-dimensions-of-weights-in-the-exponential-power-of-2">Keep dimensions of weights in the exponential power of 2</h3>
<p>Even, when dealing with <strong>state-of-the-art</strong> Deep Learning Models with latest hardware resources, <strong>memory management</strong> is still done at the byte level; So, it’s always good to keep the size of your parameters as <code class="highlighter-rouge">64</code>, <code class="highlighter-rouge">128</code>, <code class="highlighter-rouge">512</code>, <code class="highlighter-rouge">1024</code>(all powers of <code class="highlighter-rouge">2</code>). This might help in sharding the matrices, weights, etc. resulting into slight boost in learning efficiency. This becomes even more significant when dealing with <strong>GPUs</strong>.</p>
<h3 id="unsupervised-pretraining">Unsupervised Pretraining</h3>
<p>Doesn’t matter whether you are working with NLP, Computer Vision, Speech Recognition, etc. <strong>Unsupervised Pretraining</strong> always help the training of your supervised or other unsupervised models. <strong>Word Vectors</strong> in NLP are ubiquitous; you can use <a href="http://image-net.org/">ImageNet</a> dataset to pretrain your model in an unsupervised manner, for a 2-class supervised classification; or audio samples from a much larger domain to further use that information for a speaker disambiguation model.</p>
<h3 id="mini-batch-vs-stochastic-learning">Mini-Batch vs. Stochastic Learning</h3>
<p>Major objective of training a model is to learn appropriate parameters, that results into an optimal mapping from inputs to outputs. These parameters are tuned with each training sample, irrespective of your decision to use <strong>batch</strong>, <strong>mini-batch</strong> or <strong>stochastic learning</strong>. While employing a stochastic learning approach, gradients of weights are tuned after each training sample, introducing noise into gradients(hence the word ‘stochastic’). This has a very desirable effect; i.e. - with the introduction of <strong>noise</strong> during the training, the model becomes less prone to overfitting.</p>
<p>However, going through the stochastic learning approach might be relatively less efficient; since now a days machines have far more computation power. Stochastic learning might effectively waste a large portion of this. If we are capable of computing <strong>Matrix-Matrix multiplication</strong>, then why should we limit ourselves, to iterate through the multiplications of individual pairs of <strong>Vectors</strong>? Therefore, for greater throughput/faster learning, it’s recommended to use mini-batches instead of stochastic learning.</p>
<p>But, selecting an appropriate batch size is equally important; so that we can still retain some noise(by not using a huge batch) and simultaneously use the computation power of machines more effectively. Commonly, a batch of <code class="highlighter-rouge">16</code> to <code class="highlighter-rouge">128</code> examples is a good choice(exponential of <code class="highlighter-rouge">2</code>). Usually, batch size is selected, once you have already found more important hyperparameters(by <strong>manual search</strong> or <strong>random search</strong>). Nevertheless, there are scenarios when the model is getting the training data as a stream(<a href="https://en.wikipedia.org/wiki/Online_machine_learning">online learning</a>), then resorting to Stochastic Learning is a good option.</p>
<h3 id="shuffling-training-examples">Shuffling training examples</h3>
<p>This comes from <strong>Information Theory</strong> - “Learning that an unlikely event has occurred is more informative than learning that a likely event has occurred”. Similarly, randomizing the order of training examples(in different epochs, or mini-batches) will result in faster convergence. A slight boost is always noticed when the model doesn’t see a lot of examples in the same order.</p>
<h3 id="dropout-for-regularization">Dropout for Regularization</h3>
<p>Considering, millions of parameters to be learned, regularization becomes an imperative requisite to prevent <strong>overfitting</strong> in DNNs. You can keep on using <strong>L1/L2</strong> regularization as well, but <strong>Dropout</strong> is preferable to check overfitting in DNNs. Dropout is trivial to implement and generally results into faster learning. A default value of <code class="highlighter-rouge">0.5</code> is a good choice, although this depends on the specific task,. If the model is less complex, then a dropout of <code class="highlighter-rouge">0.2</code> might also suffice.</p>
<p>Dropout should be turned off, during the test phase, and weights should be scaled accordingly, as done in the <a href="https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf">original paper</a>. Just allow a model with Dropout regularization, a little bit more training time; and the error will surely go down.</p>
<h3 id="number-of-epochstraining-iterations">Number of Epochs/Training Iterations</h3>
<p>“Training a Deep Learning Model for multiple epochs will result in a better model” - we have heard it a couple of times, but how do we quantify “many”?
Turns out, there is a simple strategy for this - Just keep on training your model for a fixed amount of examples/epochs, let’s say <code class="highlighter-rouge">20,000</code> examples or <code class="highlighter-rouge">1</code> epoch. After each set of these examples compare the <strong>test error</strong> with <strong>train error</strong>, if the gap is decreasing, then keep on training. In addition to this, after each such set, save a copy of your model parameters(so that you can choose from multiple models once it is trained).</p>
<h3 id="visualize">Visualize</h3>
<p>There are a thousand ways in which the training of a deep learning model might go wrong. I guess we have all been there, when the model is being trained for hours or days and only after the training is finished, we realize something went wrong. In order to save yourself from bouts of hysteria, in such situations(which might be quite justified ;)) - <strong>always visualize the training process</strong>. Most obvious step you can take is to <strong>print/save logs</strong> of <code class="highlighter-rouge">loss</code> values, <code class="highlighter-rouge">train error</code> or <code class="highlighter-rouge">test error</code>, etc.</p>
<p>In addition to this, another good practice is to use a visualization library to plot histograms of weights after few training examples or between epochs. This might help in keeping track of some of the common problems in Deep Learning Models like <strong>Vanishing Gradient</strong>, <strong>Exploding Gradient</strong> etc.</p>
<h3 id="multi-core-machines-gpus">Multi-Core machines, GPUs</h3>
<p>Advent of GPUs, libraries that provide vectorized operations, machines with more computation power, are probably some of the most significant factors in the success of Deep Learning. If you think, you are patient as a stone, you might try running a DNN on your laptop(which can’t even open 10 tabs in your Chrome browser) and wait for ages to get your results. Or you can play smart(and expensively :z) and get a descent hardware with at least <strong>multiple CPU cores</strong> and a <strong>few hundred GPU cores</strong>. GPUs have revolutionized the Deep Learning research(no wonder Nvidia’s stocks are shooting up ;)), primarily because of their ability to perform Matrix Operations at a larger scale.</p>
<p>So, instead of taking weeks on a normal machine, these parallelization techniques, will bring down the training time to days, if not hours.</p>
<h3 id="use-libraries-with-gpu-and-automatic-differentiation-support">Use libraries with GPU and Automatic Differentiation Support</h3>
<p>Thankfully, for rapid prototyping we have some really descent libraries like <a href="http://deeplearning.net/software/theano/">Theano</a>, <a href="https://www.tensorflow.org/">Tensorflow</a>, <a href="https://keras.io/">Keras</a>, etc. Almost all of these DL libraries provide <strong>support for GPU computation</strong> and <strong>Automatic Differentiation</strong>. So, you don’t have to dive into core GPU programming(unless you want to - it’s definitely fun :)); nor you have to write your own differentiation code, which might get a little bit taxing in really complex models(although you should be able to do that, if required). Tensorflow further provides support for training your models on a <strong>distributed architecture</strong>(if you can afford it).</p>
<p>This is not at all an exhaustive list of practices, to train a DNN. In order to include just the most common practices, I have tried to exclude a few concepts like Normalization of inputs, Batch/Layer Normalization, Gradient Check, etc. Although feel free to add anything in the comment section and I’ll be more than happy to update it in the post. :)</p>
<h3 id="references">References:</h3>
<ol>
<li><a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient BackProp(Yann LeCun et al.)</a></li>
<li><a href="https://arxiv.org/pdf/1206.5533v2.pdf">Practical Recommendations for Deep Architectures(Yoshua Bengio)</a></li>
<li><a href="http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">Understanding the difficulty of training deep feedforward neural networks(Glorot and Bengio, 2010)</a></li>
<li><a href="https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf">Dropout: A Simple Way to Prevent Neural Networks from Overfitting</a></li>
<li><a href="https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b#.yd17cx8ml">Andrej Karpathy - Yes you should understand backprop(Medium)</a></li>
</ol>
<!-- % if page.comments % -->
<div id="disqus_thread"></div>
<script type="text/javascript">
/* * * CONFIGURATION VARIABLES * * */
var disqus_shortname = 'rishabhshukla';
/* * * DON'T EDIT BELOW THIS LINE * * */
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>
<!-- % endif % -->
</description>
<pubDate>Thu, 05 Jan 2017 14:30:05 +0530</pubDate>
<link>http://rishy.github.io//ml/2017/01/05/how-to-train-your-dnn/</link>
<guid isPermaLink="true">http://rishy.github.io//ml/2017/01/05/how-to-train-your-dnn/</guid>
</item>
<item>
<title>Dropout with Theano</title>
<description><p>Almost everyone working with Deep Learning would have heard a smattering about <strong>Dropout</strong>. Albiet a simple concept(<a href="https://arxiv.org/pdf/1207.0580v1.pdf">introduced</a> a couple of years ago), which sounds like a pretty obvious way for model averaging, further resulting into a more generalized and regularized Neural Net; still when you actually get into the nitty-gritty details of implementing it in your favourite library(theano being mine), you might find some roadblocks there. Why? Because it’s not exactly straight-forward to randomly deactivate some neurons in a DNN.</p>
<p>In this post, we’ll just recapitulate what has already been explained in detail about Dropout in lot of papers and online resources(some of these are provided at the end of the post). Our main focus will be on implementing a Dropout layer in <a href="https://docs.scipy.org/doc/numpy-dev/user/quickstart.html">Numpy</a> and <a href="http://deeplearning.net/software/theano/introduction.html">Theano</a>, while taking care of all the related caveats. You can find the Jupyter Notebook with the Dropout Class <a href="http://nbviewer.ipython.org/github/rishy/rishy.github.io/blob/master/ipy_notebooks/Dropout-Theano.ipynb">here</a>.</p>
<p>Regularization is a technique to prevent <a href="https://en.wikipedia.org/wiki/Overfitting">Overfitting</a> in a machine learning model. Considering the fact that a DNN has a highly complex function to fit, it can easily overfit with a small/intermediate size of dataset.</p>
<p>In very simple terms - <em>Dropout is a highly efficient regularization technique, wherein, for each iteration we randomly remove some of the neurons in a DNN</em>(along with their connections; have a look at Fig. 1). So how does this help in regularizing a DNN? Well, by randomly removing some of the cells in the computational graph(Neural Net), we are preventing some of the neurons(which are basically hidden features in a Neural Net) from overfitting on all of the training samples. So, this is more like just considering only a handful of features(neurons) for each training sample and producing the output based on these features only. This results into a completely different neural net(hopefully ;)) for each training sample, and eventually our output is the average of these different nets(any <code class="highlighter-rouge">Random Forests</code>-phile here? :D).</p>
<h2 id="graphical-overview">Graphical Overview:</h2>
<p>In Fig. 1, we have a fully connected deep neural net on the left side, where each neuron is connected to neurons in its upper and lower layers. On the right side, we have randomly omitted some neurons along with their connections. For every learning step, Neural net in Fig. 2 will have a different representation. Consequently, only the connected neurons and their weights will be learned in a particular learning step.</p>
<p style="display: flex;">
<img src="../../../../../images/nn.png" style="height: 45%; width: 45%" />
<img src="../../../../../images/dropout-nn.png" style="height: 45%; width: 45%" />
</p>
<p style="text-align: center">
Fig. 1<br />
<span style="color: #000; font-size: 1rem;">
Left: DNN without Dropout, Right: DNN with some dropped neurons
</span>
</p>
<h2 id="theano-implementation">Theano Implementation:</h2>
<p>Let’s dive straight into the code for implementing a Dropout layer. If you don’t have prior knowledge of Theano and Numpy, then please go through these two awesome blogs by <a href="https://twitter.com/dennybritz">@dennybritz</a> - <a href="http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/">Implementing a Neural Network from Scratch</a> and <a href="http://www.wildml.com/2015/09/speeding-up-your-neural-network-with-theano-and-the-gpu/">Speeding up your neural network with theano and gpu</a>.</p>
<p>As recommended, whenever we are dealing with Random numbers, it is advisable to set a random seed.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">theano.sandbox.rng_mrg</span> <span class="kn">import</span> <span class="n">MRG_RandomStreams</span> <span class="k">as</span> <span class="n">RandomStreams</span>
<span class="kn">import</span> <span class="nn">theano</span>
<span class="c"># Set seed for the random numbers</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1234</span><span class="p">)</span>
<span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">RandomState</span><span class="p">(</span><span class="mi">1234</span><span class="p">)</span>
<span class="c"># Generate a theano RandomStreams</span>
<span class="n">srng</span> <span class="o">=</span> <span class="n">RandomStreams</span><span class="p">(</span><span class="n">rng</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">999999</span><span class="p">))</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Let’s enumerate through each line in the above code. Firstly, we import all the necessary modules(more about <code class="highlighter-rouge">RandomStreams</code> in the next few lines) and initialize the random seed, so the random numbers generated are consistent in each different run. On the second line we create an object <code class="highlighter-rouge">rng</code> of <code class="highlighter-rouge">numpy.random.RandomState</code>, this exposes a number of methods for generating random numbers, drawn from a variety of probability distributions.</p>
<p>Theano is designed in a functional manner, as a result of this generating random numbers in Theano Computation graphs is a bit tricky compared to Numpy. Using Random Variables with Theano is equivalent to imputing random variables in the Computation graph. Theano will allocate a numpy <code class="highlighter-rouge">RandomState</code> object for each such variable, and draw from it as necessary. Theano calls this sort of sequence of random numbers a <code class="highlighter-rouge">Random Stream</code>. The <code class="highlighter-rouge">MRG_RandomStreams</code> we are using is another implementation of <code class="highlighter-rouge">RandomStreams</code> in Theano, which works for GPUs as well.</p>
<p>So, finally we create a <code class="highlighter-rouge">srng</code> object which will provide us with Random Streams in each run of our Optimization Function.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">dropit</span><span class="p">(</span><span class="n">srng</span><span class="p">,</span> <span class="n">weight</span><span class="p">,</span> <span class="n">drop</span><span class="p">):</span>
<span class="c"># proportion of probability to retain</span>
<span class="n">retain_prob</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">drop</span>
<span class="c"># a masking variable</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">srng</span><span class="o">.</span><span class="n">binomial</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">retain_prob</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">weight</span><span class="o">.</span><span class="n">shape</span><span class="p">,</span>
<span class="n">dtype</span><span class="o">=</span><span class="s">'floatX'</span><span class="p">)</span>
<span class="c"># final weight with dropped neurons</span>
<span class="k">return</span> <span class="n">theano</span><span class="o">.</span><span class="n">tensor</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">weight</span> <span class="o">*</span> <span class="n">mask</span><span class="p">,</span>
<span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">)</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Here is our main Dropout function with three arguments: <code class="highlighter-rouge">srng</code> - A RandomStream generator, <code class="highlighter-rouge">weight</code> - Any theano tensor(Weights of a Neural Net), and <code class="highlighter-rouge">drop</code> - a float value to denote the proportion of neurons to drop. So, naturally number of neurons to retain will be <code class="highlighter-rouge">1 - drop</code>.</p>
<p>On the second line in the function, we are generating a RandomStream from <a href="https://en.wikipedia.org/wiki/Binomial_distribution">Binomial Distribution</a>, where <code class="highlighter-rouge">n</code> denotes the number of trials, <code class="highlighter-rouge">p</code> is the probability with which to retain the neurons and <code class="highlighter-rouge">size</code> is the shape of the output. As the final step, all we need to do is to switch the value of some of the neurons to <code class="highlighter-rouge">0</code>, which can be accomplished by simply multiplying <code class="highlighter-rouge">mask</code> with the <code class="highlighter-rouge">weight</code> tensor/matrix. <code class="highlighter-rouge">theano.tensor.cast</code> is further type casting the resulting value to the value of <code class="highlighter-rouge">theano.config.floatX</code>, which is either the default value of <code class="highlighter-rouge">floatX</code>, which is <code class="highlighter-rouge">float32</code> in theano or any other value that we might have mentioned in <code class="highlighter-rouge">.theanorc</code> configuration file.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">dont_dropit</span><span class="p">(</span><span class="n">weight</span><span class="p">,</span> <span class="n">drop</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">drop</span><span class="p">)</span><span class="o">*</span><span class="n">theano</span><span class="o">.</span><span class="n">tensor</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">weight</span><span class="p">,</span> <span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">)</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Now, one thing to keep in mind is - we only want to drop neurons during the training phase and not during the validation or test phase. Also, we need to somehow compensate for the fact that during the training time we deactivated some of the neurons. There are two ways to achieve this:</p>
<ol>
<li>
<p><strong>Scaling the Weights</strong>(implemented at the test phase): Since, our resulting Neural Net is an averaged model, it makes sense to use the averaged value of the weights during the test phase, considering the fact that we are not deactivating any neurons here. The easiest way to do this is to scale the weights(which acts as averaging) by the factor of retained probability, in the training phase. This is exactly what we are doing in the above function.</p>
</li>
<li>
<p><strong>Inverted Dropout</strong>(implemented at the training phase): Now scaling the weights has its caveats, since we have to tweak the weights at the test time. On the other end ‘Inverted Dropout’ performs the scaling at the training time. So, we don’t have to tweak the test code whenever we decide to change the order of Dropout layer. In this post, we’ll be using the first method(scaling), although I’d recommend you to play with Inverted Dropout as well. You can follow <a href="https://github.com/cs231n/cs231n.github.io/blob/master/neural-networks-2.md#reg">this</a> up for the guidance.</p>
</li>
</ol>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">dropout_layer</span><span class="p">(</span><span class="n">weight</span><span class="p">,</span> <span class="n">drop</span><span class="p">,</span> <span class="n">train</span> <span class="o">=</span> <span class="mi">1</span><span class="p">):</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">ifelse</span><span class="o">.</span><span class="n">ifelse</span><span class="p">(</span><span class="n">theano</span><span class="o">.</span><span class="n">tensor</span><span class="o">.</span><span class="n">eq</span><span class="p">(</span><span class="n">train</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">dropit</span><span class="p">(</span><span class="n">weight</span><span class="p">,</span> <span class="n">drop</span><span class="p">),</span> <span class="n">dont_dropit</span><span class="p">(</span><span class="n">weight</span><span class="p">,</span> <span class="n">drop</span><span class="p">))</span>
<span class="k">return</span> <span class="n">result</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Our final <code class="highlighter-rouge">dropout_layer</code> function uses <code class="highlighter-rouge">theano.ifelse</code> module to return the value of either <code class="highlighter-rouge">dropit</code> or <code class="highlighter-rouge">dont_dropit</code> function. This is conditioned on whether our <code class="highlighter-rouge">train</code> flag is on or off. So, while the model is in training phase, we’ll use dropout for our model weights and in test phase, we would simply scale the weights to compensate for all the training steps, where we omitted some random neurons.</p>
<p>Finally, here’s how you can add a Dropout layer in your DNN. I am taking an example of RNN, similar to the one used in <a href="http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/">this</a> blog:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25</pre></td><td class="code"><pre><span class="n">x</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">ivector</span><span class="p">(</span><span class="s">'x'</span><span class="p">)</span>
<span class="n">drop_value</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">scalar</span><span class="p">(</span><span class="s">'drop_value'</span><span class="p">)</span>
<span class="n">dropout</span> <span class="o">=</span> <span class="n">Dropout</span><span class="p">()</span>
<span class="n">gru</span> <span class="o">=</span> <span class="n">GRU</span><span class="p">(</span><span class="o">...</span><span class="p">)</span> <span class="c">#An object of GRU class with required arguments</span>
<span class="n">params</span> <span class="o">=</span> <span class="n">OrderedDict</span><span class="p">(</span><span class="o">...</span><span class="p">)</span> <span class="c">#A dictionary of model parameters</span>
<span class="k">def</span> <span class="nf">forward_prop</span><span class="p">(</span><span class="n">x_t</span><span class="p">,</span> <span class="n">s_t_prev</span><span class="p">,</span> <span class="n">drop_value</span><span class="p">,</span> <span class="n">train</span><span class="p">,</span> <span class="n">E</span><span class="p">,</span> <span class="n">U</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
<span class="c"># Word vector embeddings</span>
<span class="n">x_e</span> <span class="o">=</span> <span class="n">E</span><span class="p">[:,</span> <span class="n">x_t</span><span class="p">]</span>
<span class="c"># GRU Layer</span>
<span class="n">W</span> <span class="o">=</span> <span class="n">dropout</span><span class="o">.</span><span class="n">dropout_layer</span><span class="p">(</span><span class="n">W</span><span class="p">,</span> <span class="n">drop_value</span><span class="p">,</span> <span class="n">train</span><span class="p">)</span>
<span class="n">U</span> <span class="o">=</span> <span class="n">dropout</span><span class="o">.</span><span class="n">dropout_layer</span><span class="p">(</span><span class="n">U</span><span class="p">,</span> <span class="n">drop_value</span><span class="p">,</span> <span class="n">train</span><span class="p">)</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">gru</span><span class="o">.</span><span class="n">GRU_layer</span><span class="p">(</span><span class="n">x_e</span><span class="p">,</span> <span class="n">s_t_prev</span><span class="p">,</span> <span class="n">U</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">return</span> <span class="n">s_t</span>
<span class="n">s</span><span class="p">,</span> <span class="n">updates</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">scan</span><span class="p">(</span><span class="n">forward_prop</span><span class="p">,</span>
<span class="n">sequences</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">],</span>
<span class="n">non_sequences</span> <span class="o">=</span> <span class="p">[</span><span class="n">drop_value</span><span class="p">,</span> <span class="n">train</span><span class="p">,</span> <span class="n">params</span><span class="p">[</span><span class="s">'E'</span><span class="p">],</span>
<span class="n">params</span><span class="p">[</span><span class="s">'U'</span><span class="p">],</span> <span class="n">params</span><span class="p">[</span><span class="s">'W'</span><span class="p">],</span> <span class="n">params</span><span class="p">[</span><span class="s">'b'</span><span class="p">]],</span>
<span class="n">outputs_info</span> <span class="o">=</span> <span class="p">[</span><span class="nb">dict</span><span class="p">(</span><span class="n">initial</span><span class="o">=</span><span class="n">T</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">hidden_dim</span><span class="p">))])</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Here, we have the <code class="highlighter-rouge">forward_prop</code> function for RNN+GRU model. Starting from the first line, we are creating a theano tensor variable <code class="highlighter-rouge">x</code>, for input(words) and another <code class="highlighter-rouge">drop_value</code> variable of type <code class="highlighter-rouge">theano.tensor.scalar</code>, which will take a float value to denote the proportion of neurons to be dropped.</p>
<p>Then we are creating an object <code class="highlighter-rouge">dropout</code> of the <code class="highlighter-rouge">Dropout</code> class, we implemented in previous sections. After this, we are initiating a <code class="highlighter-rouge">GRU</code> object(I have kept this as a generic class, since you might have a different implementation). We also have one more variable, namely <code class="highlighter-rouge">params</code> which is an <code class="highlighter-rouge">OrderedDict</code> containing the model parameters.</p>
<p>Furthermore, <code class="highlighter-rouge">E</code> is our Word Embedding Matrix, <code class="highlighter-rouge">U</code> contains, input to hidden layer weights, <code class="highlighter-rouge">W</code> is the hidden to hidden layer weights and <code class="highlighter-rouge">b</code> is the bias. Then we have our workhorse - the <code class="highlighter-rouge">forward_prop</code> function, which is called iteratively for each value in <code class="highlighter-rouge">x</code> variable(here these values will be the indexes for sequential words in the text). Now, all we have to do is call the <code class="highlighter-rouge">dropout_layer</code> function from <code class="highlighter-rouge">forward_prop</code>, which will return <code class="highlighter-rouge">W</code>, <code class="highlighter-rouge">U</code>, with few dropped neurons.</p>
<p>This is it in terms of implementing and using a dropout layer with Theano. Although, there are a few things mentioned in the next section, which you have to keep in mind when working with <code class="highlighter-rouge">RandomStreams</code>.</p>
<h2 id="few-things-to-take-care-of">Few things to take care of:</h2>
<p><b>Wherever we are going to use a <code class="highlighter-rouge">theano.function</code> after this, we’ll have to explicitly pass it the <code class="highlighter-rouge">updates</code>, we got from <code class="highlighter-rouge">theano.scan</code> function in previous section. Reason?</b>
Whenever there is a call to theano’s <code class="highlighter-rouge">RandomStreams</code>, it throws some updates, and all of the theano functions, following the above code, should be made aware of these updates. So let’s have a look at this code:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14</pre></td><td class="code"><pre><span class="n">o</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">nnet</span><span class="o">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">params</span><span class="p">[</span><span class="s">'V'</span><span class="p">]</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])))</span>
<span class="n">prediction</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">o</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="c"># cost/loss function</span>
<span class="n">loss</span> <span class="o">=</span> <span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">nnet</span><span class="o">.</span><span class="n">categorical_crossentropy</span><span class="p">(</span><span class="n">o</span><span class="p">,</span> <span class="n">y</span><span class="p">))</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="c"># cast values in 'updates' variable to a list</span>
<span class="n">updates</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">updates</span><span class="o">.</span><span class="n">items</span><span class="p">())</span>
<span class="c"># couple of commonly used theano functions with 'updates '</span>
<span class="n">predict</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">([</span><span class="n">x</span><span class="p">],</span> <span class="n">o</span><span class="p">,</span> <span class="n">updates</span> <span class="o">=</span> <span class="n">updates</span><span class="p">)</span>
<span class="n">predict_class</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">([</span><span class="n">x</span><span class="p">],</span> <span class="n">prediction</span><span class="p">,</span> <span class="n">updates</span> <span class="o">=</span> <span class="n">updates</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">([</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">],</span> <span class="n">loss</span><span class="p">,</span> <span class="n">updates</span> <span class="o">=</span> <span class="n">updates</span><span class="p">)</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>As a standard procedure, we are using another model parameter <code class="highlighter-rouge">V</code>(hidden to output) and taking a <code class="highlighter-rouge">softmax</code> over this. If you have a look at <code class="highlighter-rouge">predict</code>, <code class="highlighter-rouge">loss</code> functions, then we had to explicitly, tell them about the <code class="highlighter-rouge">updates</code> that <code class="highlighter-rouge">RandomStreams</code> made during the execution of <code class="highlighter-rouge">dropout_layer</code> function. Else, this will throw an error in Theano.</p>
<p><b>What is the appropriate float value for dropout?</b>
To be on the safe side, a value of <code class="highlighter-rouge">0.5</code>(as mentioned in the original <a href="https://arxiv.org/pdf/1207.0580v1.pdf">paper</a>) is generally good enough. Although, you could always try to tweak it a bit and see what works best for your model.</p>
<h2 id="alternatives-to-dropout">Alternatives to Dropout</h2>
<p>Lately, there has been a lot of research for better regularization methods in DNNs. One of the things that I really like about Dropout is that it’s conceptually very simple as well as an highly effective way to prevent overfitting. A few more methods, that are increasingly being used in DNNs now a days(I am omitting the standard L1/L2 regularization here):</p>
<ol>
<li>
<p><strong>Batch Normalization:</strong>
Batch Normalization primarily tackles the problem of <em>internal covariate shift</em> by normalizing the weights in each mini-batch. So, in addition to simply using normalized weights at the beginning of the training process, Batch Normalization will keep on normalizing them during the whole training phase. This accelerates the optimization process and as a side product, might also eliminate the need of Dropout. Have a look at the original <a href="https://arxiv.org/pdf/1502.03167.pdf">paper</a> for more in-depth explanation.</p>
</li>
<li>
<p><strong>Max-Norm:</strong>
Max-Norm puts a specific upper bound on the magnitude of weight matrices and if the magnitude exceeds this threshold then the values of weight matrices are clipped down. This is particularly helpful for exploding gradient problem.</p>
</li>
<li>
<p><strong>DropConnect:</strong>
When training with Dropout, a randomly selected subset of activations are set to zero within each layer. DropConnect instead sets a randomly selected subset of weights within the network to zero. Each unit thus receives input from a random subset of units in the previous layer. We derive a bound on the generalization performance of both Dropout and DropConnect. - Abstract from the original <a href="https://cs.nyu.edu/~wanli/dropc/dropc.pdf">paper</a>.</p>
</li>
<li>
<p><strong>ZoneOut(specific to RNNs):</strong>
In each training step, ZoneOut keeps the value of some of the hidden units unchanged. So, instead of throwing out the information, it enforces a random number of hidden units to propogate the same information in next time step.</p>
</li>
</ol>
<p>The reason I wanted to write about this, is because if you are working with a low level library like Theano, then sometimes using modules like <code class="highlighter-rouge">RandomStreams</code> might get a bit tricky. Although, for prototyping and even for production purposes, you should also consider other high level libraries like <a href="https://keras.io/">Keras</a> and <a href="https://www.tensorflow.org/">TensorFlow</a>.</p>
<p>Feel free, to add any other regularization methods and feedbacks, in the comments section.</p>
<p>Suggested Readings:</p>
<ol>
<li><a href="http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/">Implementing a Neural Network From Scratch - Wildml</a></li>
<li><a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/">Introduction to Recurrent Neural Networks - Wildml</a></li>
<li><a href="https://arxiv.org/pdf/1207.0580v1.pdf">Improving neural networks by preventing co-adaptation of feature detectors</a></li>
<li><a href="http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf">Dropout: A Simple Way to Prevent Neural Networks from Overfitting</a></li>
<li><a href="http://wiki.ubc.ca/Course:CPSC522/Regularization_for_Neural_Networks">Regularization for Neural Networks</a></li>
<li><a href="http://wikicoursenote.com/wiki/Dropout">Dropout - WikiCourse</a></li>
<li><a href="https://papers.nips.cc/paper/4124-practical-large-scale-optimization-for-max-norm-regularization.pdf">Practical large scale optimization for Max Norm Regularization</a></li>
<li><a href="https://cs.nyu.edu/~wanli/dropc/dropc.pdf">DropConnect Paper</a></li>
<li><a href="https://arxiv.org/abs/1606.01305">ZoneOut Paper</a></li>
<li><a href="https://github.com/cs231n/cs231n.github.io/blob/master/neural-networks-2.md#reg">Regularization in Neural Networks</a></li>
<li><a href="https://arxiv.org/pdf/1502.03167.pdf">Batch Normalization Paper</a></li>
</ol>
<!-- % if page.comments % -->
<div id="disqus_thread"></div>
<script type="text/javascript">
/* * * CONFIGURATION VARIABLES * * */
var disqus_shortname = 'rishabhshukla';
/* * * DON'T EDIT BELOW THIS LINE * * */
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>
<!-- % endif % -->
</description>
<pubDate>Wed, 12 Oct 2016 17:30:05 +0530</pubDate>
<link>http://rishy.github.io//ml/2016/10/12/dropout-with-theano/</link>
<guid isPermaLink="true">http://rishy.github.io//ml/2016/10/12/dropout-with-theano/</guid>
</item>
<item>
<title>L1 vs. L2 Loss function</title>
<description><p>Least absolute deviations(L1) and Least square errors(L2) are the two standard loss functions, that decides what function should be minimized while learning from a dataset.</p>
<p>L1 Loss function minimizes the <b>absolute differences</b> between the estimated values and the existing target values. So, summing up each target &lt;/span&gt; value <span>\( y_i \)</span> and corresponding estimated value <span>\( h(x_i) \)</span>, where <span>\( x_i \)</span> denotes the feature set of a single sample, Sum of absolute differences for ‘n’ samples can be calculated as,</p>
<div>
$$
\begin{align*}
&amp; S = \sum_{i=0}^n|y_i - h(x_i)|
\end{align*}
$$
</div>
<p>On the other hand, L2 loss function minimizes the <b>squared differences</b> between the estimated and existing target values.</p>
<div>
$$
\begin{align*}
&amp; S = \sum_{i=0}^n(y_i - h(x_i))^2
\end{align*}
$$
</div>
<p>As apparent from above formulae that L2 error will be much larger in the case of outliers compared to L1. Since, the difference between an incorrectly predicted target value and original target value will be quite large and squaring it will make it even larger.</p>
<p>As a result, L1 loss function is more robust and is generally not affected by outliers. On the contrary L2 loss function will try to adjust the model according to these outlier values, even on the expense of other samples. Hence, L2 loss function is highly sensitive to outliers in the dataset.</p>
<p>We’ll see how outliers can affect the performance of a regression model. We are going to use pandas, scikit-learn and numpy to work through this. I’d highly recommend to have a look at the <a href="http://nbviewer.ipython.org/github/rishy/rishy.github.io/blob/master/ipy_notebooks/L1%20vs.%20L2%20Loss.ipynb">ipython notebook</a> containing the code on this post.</p>
<p>We’ll be using Boston Housing Prices dataset and will to try to predict the prices using Gradient Boosting Regressor from scikit-learn. You can downloaded the dataset directly from <a href="https://archive.ics.uci.edu/ml/datasets/Housing">UCI Datasets</a> or from this <a href="../../../../../ipy_notebooks/Datasets/Housing.csv">csv</a>.</p>
<p>We are goint to start with reading the data from the csv file.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">sklearn.cross_validation</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">GradientBoostingRegressor</span>
<span class="kn">from</span> <span class="nn">statsmodels.tools.eval_measures</span> <span class="kn">import</span> <span class="n">rmse</span>
<span class="kn">import</span> <span class="nn">matplotlib.pylab</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="c"># Make pylab inline and set the theme to 'ggplot'</span>
<span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">(</span><span class="s">'ggplot'</span><span class="p">)</span>
<span class="o">%</span><span class="n">pylab</span> <span class="n">inline</span>
<span class="c"># Read Boston Housing Data</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'Datasets/Housing.csv'</span><span class="p">)</span>
<span class="c"># Create a data frame with all the independent features</span>
<span class="n">data_indep</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'medv'</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="c"># Create a target vector(vector of dependent variable, i.e. 'medv')</span>
<span class="n">data_dep</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s">'medv'</span><span class="p">]</span>
<span class="c"># Split data into training and test sets</span>
<span class="n">train_X</span><span class="p">,</span> <span class="n">test_X</span><span class="p">,</span> <span class="n">train_y</span><span class="p">,</span> <span class="n">test_y</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span>
<span class="n">data_indep</span><span class="p">,</span> <span class="n">data_dep</span><span class="p">,</span>
<span class="n">test_size</span> <span class="o">=</span> <span class="mf">0.20</span><span class="p">,</span>
<span class="n">random_state</span> <span class="o">=</span> <span class="mi">42</span><span class="p">)</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<h4 id="regression-without-any-outliers">Regression without any Outliers:</h4>
<p>At this moment, our housing dataset is pretty much clean and doesn’t contain any outliers as such.
So let’s fit a GB regressor with L1 and L2 loss functions.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11</pre></td><td class="code"><pre><span class="c"># GradientBoostingRegressor with a L1(Least Absolute Deviations) loss function</span>
<span class="c"># Set a random seed so that we can reproduce the results</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">32767</span><span class="p">)</span>
<span class="n">mod</span> <span class="o">=</span> <span class="n">GradientBoostingRegressor</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'lad'</span><span class="p">)</span>
<span class="n">fit</span> <span class="o">=</span> <span class="n">mod</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_X</span><span class="p">,</span> <span class="n">train_y</span><span class="p">)</span>
<span class="n">predict</span> <span class="o">=</span> <span class="n">fit</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">test_X</span><span class="p">)</span>
<span class="c"># Root Mean Squared Error</span>
<span class="k">print</span> <span class="s">"RMSE -&gt; </span><span class="si">%</span><span class="s">f"</span> <span class="o">%</span> <span class="n">rmse</span><span class="p">(</span><span class="n">predict</span><span class="p">,</span> <span class="n">test_y</span><span class="p">)</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>With a L1 loss function and no outlier we get a value of RMSE: 3.440147.
Let’s see what results we get with L2 loss function.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8</pre></td><td class="code"><pre><span class="c"># GradientBoostingRegressor with L2(Least Square errors) loss function</span>
<span class="n">mod</span> <span class="o">=</span> <span class="n">GradientBoostingRegressor</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'ls'</span><span class="p">)</span>
<span class="n">fit</span> <span class="o">=</span> <span class="n">mod</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_X</span><span class="p">,</span> <span class="n">train_y</span><span class="p">)</span>
<span class="n">predict</span> <span class="o">=</span> <span class="n">fit</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">test_X</span><span class="p">)</span>
<span class="c"># Root Mean Squared Error</span>
<span class="k">print</span> <span class="s">"RMSE -&gt; </span><span class="si">%</span><span class="s">f"</span> <span class="o">%</span> <span class="n">rmse</span><span class="p">(</span><span class="n">predict</span><span class="p">,</span> <span class="n">test_y</span><span class="p">)</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>This prints out a mean squared value of RMSE -&gt; 2.542019.</p>
<p>As apparent from RMSE errors of L1 and L2 loss functions, Least Squares(L2)
outperform L1, when there are no outliers in the data.</p>
<h4 id="regression-with-outliers">Regression with Outliers:</h4>
<p>After looking at the minimum and maximum values of ‘medv’ column, we can see
that the range of values in ‘medv’ is [5, 50].<br />
Let’s add a few Outliers in this Dataset, so that we can see some significant
differences with <b>L1</b> and <b>L2</b> loss functions.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4</pre></td><td class="code"><pre><span class="c"># Get upper and lower bounds[min, max] of all the features</span>
<span class="n">stats</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
<span class="n">extremes</span> <span class="o">=</span> <span class="n">stats</span><span class="o">.</span><span class="n">loc</span><span class="p">[[</span><span class="s">'min'</span><span class="p">,</span> <span class="s">'max'</span><span class="p">],:]</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'medv'</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">extremes</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Now, we are going to generate 5 random samples, such that their values lies in
the [min, max] range of respective features.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17</pre></td><td class="code"><pre><span class="c"># Set a random seed</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1234</span><span class="p">)</span>
<span class="c"># Create 5 random values </span>
<span class="n">rands</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">rands</span>
<span class="c"># Get the 'min' and 'max' rows as numpy array</span>
<span class="n">min_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">extremes</span><span class="o">.</span><span class="n">loc</span><span class="p">[[</span><span class="s">'min'</span><span class="p">],</span> <span class="p">:])</span>
<span class="n">max_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">extremes</span><span class="o">.</span><span class="n">loc</span><span class="p">[[</span><span class="s">'max'</span><span class="p">],</span> <span class="p">:])</span>
<span class="c"># Find the difference(range) of 'max' and 'min'</span>
<span class="nb">range</span> <span class="o">=</span> <span class="n">max_array</span> <span class="o">-</span> <span class="n">min_array</span>
<span class="c"># Generate 5 samples with 'rands' value</span>
<span class="n">outliers_X</span> <span class="o">=</span> <span class="p">(</span><span class="n">rands</span> <span class="o">*</span> <span class="nb">range</span><span class="p">)</span> <span class="o">+</span> <span class="n">min_array</span>
<span class="n">outliers_X</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>array([[ 17.04578252, 19.15194504, 5.68465061, 0.19151945,
0.47807845, 4.56054001, 21.49653863, 3.23572024,
5.40494736, 287.356192 , 14.40028283, 76.27278363,
8.67066488],…,
[ 69.40067405, 77.99758081, 21.73774005, 0.77997581,
0.76406824, 7.63169374, 78.63565097, 9.70691596,
18.93944359, 595.70732345, 19.9317726 , 309.64280598,
29.99632329]])</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3</pre></td><td class="code"><pre><span class="c"># We will also create some hard coded outliers</span>
<span class="c"># for 'medv', i.e. our target</span>
<span class="n">medv_outliers</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">600</span><span class="p">,</span> <span class="mi">700</span><span class="p">,</span> <span class="mi">600</span><span class="p">])</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14</pre></td><td class="code"><pre><span class="c"># Change the type of 'chas', 'rad' and 'tax' to rounded of Integers</span>
<span class="n">outliers_X</span><span class="p">[:,</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="p">]]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">round</span><span class="p">(</span><span class="n">outliers_X</span><span class="p">[:,</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="p">]]))</span>
<span class="c"># Finally concatenate our existing 'train_X' and</span>
<span class="c"># 'train_y' with these outliers</span>
<span class="n">train_X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">train_X</span><span class="p">,</span> <span class="n">outliers_X</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">train_y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">train_y</span><span class="p">,</span> <span class="n">medv_outliers</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>
<span class="c"># Plot a histogram of 'medv' in train_y</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">13</span><span class="p">,</span><span class="mi">7</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">train_y</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="nb">range</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="mi">10</span><span class="p">,</span> <span class="mi">800</span><span class="p">))</span>
<span class="n">fig</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">'medv Count'</span><span class="p">,</span> <span class="n">fontsize</span> <span class="o">=</span> <span class="mi">20</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'medv'</span><span class="p">,</span> <span class="n">fontsize</span> <span class="o">=</span> <span class="mi">16</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'count'</span><span class="p">,</span> <span class="n">fontsize</span> <span class="o">=</span> <span class="mi">16</span><span class="p">)</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p><img src="../../../../../images/l1-l2-loss.png" alt="png" /></p>
<p>You can see there are some clear outliers at 600, 700 and even one or two ‘medv’
values are 0.<br />
Since, our outliers are in place now, we will once again fit the
GradientBoostingRegressor with L1 and L2 Loss functions to see the contrast in
their performances with outliers.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10</pre></td><td class="code"><pre><span class="c"># GradientBoostingRegressor with L1 loss function</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">9876</span><span class="p">)</span>
<span class="n">mod</span> <span class="o">=</span> <span class="n">GradientBoostingRegressor</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'lad'</span><span class="p">)</span>
<span class="n">fit</span> <span class="o">=</span> <span class="n">mod</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_X</span><span class="p">,</span> <span class="n">train_y</span><span class="p">)</span>
<span class="n">predict</span> <span class="o">=</span> <span class="n">fit</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">test_X</span><span class="p">)</span>
<span class="c"># Root Mean Squared Error</span>
<span class="k">print</span> <span class="s">"RMSE -&gt; </span><span class="si">%</span><span class="s">f"</span> <span class="o">%</span> <span class="n">rmse</span><span class="p">(</span><span class="n">predict</span><span class="p">,</span> <span class="n">test_y</span><span class="p">)</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We get a RMSE value of 7.055568, with L1 loss function and existing outliers.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8</pre></td><td class="code"><pre><span class="c"># GradientBoostingRegressor with L2 loss function</span>
<span class="n">mod</span> <span class="o">=</span> <span class="n">GradientBoostingRegressor</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'ls'</span><span class="p">)</span>
<span class="n">fit</span> <span class="o">=</span> <span class="n">mod</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_X</span><span class="p">,</span> <span class="n">train_y</span><span class="p">)</span>
<span class="n">predict</span> <span class="o">=</span> <span class="n">fit</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">test_X</span><span class="p">)</span>
<span class="c"># Root Mean Squared Error</span>
<span class="k">print</span> <span class="s">"RMSE -&gt; </span><span class="si">%</span><span class="s">f"</span> <span class="o">%</span> <span class="n">rmse</span><span class="p">(</span><span class="n">predict</span><span class="p">,</span> <span class="n">test_y</span><span class="p">)</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>On the other hand, we get a RMSE value of 9.806251, with L2 loss function and existing outliers.</p>
<p>With outliers in the dataset, a L2(Loss function) tries to adjust the
model according to these outliers on the expense of other
good-samples, since the squared-error is going to be huge for these outliers(for
error &gt; 1). On the other hand L1(Least absolute deviation) is quite resistant to
outliers.<br />
As a result, L2 loss function may result in huge deviations in some of the
samples which results in reduced accuracy.</p>
<p>So, if you can ignore the ouliers in your dataset or you need them to be there, then you should be using a L1 loss function, on the other hand if you don’t want undesired outliers in the dataset and would like to use a stable solution then first of all you should try to remove the outliers and then use a L2 loss function. Or performance of a model with a L2 loss function may deteriorate badly due to the presence of outliers in the dataset.</p>
<p>Whenever in doubt, prefer L2 loss function, it works pretty well in most of the situations.</p>
<!-- % if page.comments % -->
<div id="disqus_thread"></div>
<script type="text/javascript">
/* * * CONFIGURATION VARIABLES * * */
var disqus_shortname = 'rishabhshukla';
/* * * DON'T EDIT BELOW THIS LINE * * */
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>
<!-- % endif % -->
</description>
<pubDate>Tue, 28 Jul 2015 17:30:05 +0530</pubDate>
<link>http://rishy.github.io//ml/2015/07/28/l1-vs-l2-loss/</link>
<guid isPermaLink="true">http://rishy.github.io//ml/2015/07/28/l1-vs-l2-loss/</guid>
</item>
<item>
<title>Normal/Gaussian Distributions</title>
<description><p>Normal Distributions are the most common distributions in statistics primarily because they describe a lot of natural phenomena. Normal distributions are also known as ‘Gaussian distributions’ or ‘bell curve’, because of the bell shaped curve.</p>
<p><img src="../../../../../images/normal_distributions.png" alt="bell" /></p>
<p>Samples of heights of people, size of things produced by machines, errors in measurements, blood pressure, marks in an examination, wages payed to employees by a company, life span of a species, all of these follows a normal or nearly normal distribution.</p>
<p>I don’t intend to cover a lot of mathematical background regarding normal distributions, still it won’t hurt to know just a few simple mathematical properties of normal distributions:</p>
<ul>
<li>Bell curve is symmetrical about mean(which lies at the center)</li>
<li>mean = median = mode</li>
<li>Only determining factors of normal distributions are its mean and standard deviation</li>
</ul>
<p>We can also get a normal distribution from a lot of datasets using <a href="http://en.wikipedia.org/wiki/Central_limit_theorem">Central Limit Theorem</a>(CLT). In layman’s language CLT states that if we take a large number of samples from a population, multiple times and go on plotting these then it will result in a normal distribution(which can be used by a lot of statistical and machine learning models).</p>
<p>A lot of machine learning models assumes that data fed to these models follows a normal distribution. So, after you have got your data cleaned, you should definitely check what distribution it follows. Some of the machine learning and Statistical models which assumes a normally distributed input data are:</p>
<ul>
<li>Gaussian naive Bayes</li>
<li>Least Squares based (regression)models</li>
<li>LDA</li>
<li>QDA</li>
</ul>
<p>It is also quite common to transform non-normal data to normal form by applying log, square root or similar transormations.</p>
<p>If plotting the data results in a skewed plot, then it is probably a log-normal distribution(as shown in figure below), which you can transform into normal form, simply by applying a log function on all data points.</p>
<p><img src="../../../../../images/log-normal.png" alt="log-normal" /></p>
<p>Once it is transformed into normal distributions, you are free to use this dataset with models assuming a normal input data(as listed in above section).</p>
<p>As a general approach, <b>Always look at the statistical/probability distributions</b> as your first step in data analysis.</p>
<!-- % if page.comments % -->
<div id="disqus_thread"></div>
<script type="text/javascript">
/* * * CONFIGURATION VARIABLES * * */
var disqus_shortname = 'rishabhshukla';
/* * * DON'T EDIT BELOW THIS LINE * * */
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>
<!-- % endif % -->
</description>
<pubDate>Tue, 21 Jul 2015 09:30:01 +0530</pubDate>
<link>http://rishy.github.io//stats/2015/07/21/normal-distributions/</link>
<guid isPermaLink="true">http://rishy.github.io//stats/2015/07/21/normal-distributions/</guid>
</item>
<item>
<title>Electricity Demand Analysis and Appliance Detection</title>
<description><p>In this post, we are going to analyze electricity consumption data from a house. We have a time-series dataset which contains the power(kWh), Cost of electricity and Voltage at a particular time stamp. We are further provided with the temperature records during the same day for each hour. You can download the compressed dataset from <a href="https://github.com/rishy/electricity-demand-analysis/blob/master/data-science.gz">here</a>. I’d further recommend you to have a look at the corresponding <a href="https://github.com/rishy/electricity-demand-analysis/blob/master/Electricity%20Demand.ipynb">ipython notebook</a>.</p>
<p>First part is the Data Analysis Part where we will be doing the basic data cleaning and analysis regarding the power demand and cost incurred. The second part employs a KMeans clustering approach to identify which appliance might be the major cause for the power demand in a particular hour of the day.</p>
<p>So let’s start with basic imports and reading of data from the given dataset.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16</pre></td><td class="code"><pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="c"># Read the sensor dataset into pandas dataframe</span>
<span class="n">sensor_data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'merged-sensor-files.csv'</span><span class="p">,</span>
<span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s">"MTU"</span><span class="p">,</span> <span class="s">"Time"</span><span class="p">,</span> <span class="s">"Power"</span><span class="p">,</span>
<span class="s">"Cost"</span><span class="p">,</span> <span class="s">"Voltage"</span><span class="p">],</span> <span class="n">header</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span>
<span class="c"># Read the weather data in pandas series object</span>
<span class="n">weather_data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_json</span><span class="p">(</span><span class="s">'weather.json'</span><span class="p">,</span> <span class="n">typ</span> <span class="o">=</span><span class="s">'series'</span><span class="p">)</span>
<span class="c"># A quick look at the datasets</span>
<span class="n">sensor_data</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<div style="max-height:1000px;max-width:1500px;overflow:auto;">
<table>
<thead>
<tr>
<th></th>
<th>MTU</th>
<th>Time</th>
<th>Power</th>
<th>Cost</th>
<th>Voltage</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>MTU1</td>
<td>05/11/2015 19:59:06</td>
<td>4.102</td>
<td>0.62</td>
<td>122.4</td>
</tr>
<tr>
<th>1</th>
<td>MTU1</td>
<td>05/11/2015 19:59:05</td>
<td>4.089</td>
<td>0.62</td>
<td>122.3</td>
</tr>
<tr>
<th>2</th>
<td>MTU1</td>
<td>05/11/2015 19:59:04</td>
<td>4.089</td>
<td>0.62</td>
<td>122.3</td>
</tr>
<tr>
<th>3</th>
<td>MTU1</td>
<td>05/11/2015 19:59:06</td>
<td>4.089</td>
<td>0.62</td>
<td>122.3</td>
</tr>
<tr>
<th>4</th>
<td>MTU1</td>
<td>05/11/2015 19:59:04</td>
<td>4.097</td>
<td>0.62</td>
<td>122.4</td>
</tr>
</tbody>
</table>
</div>
<p><br />
Let’s have a quick look at the weather dataset as well:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>weather_data
</code></pre>
</div>
<pre>
2015-05-12 00:00:00 75.4
2015-05-12 01:00:00 73.2
2015-05-12 02:00:00 72.1
2015-05-12 03:00:00 71.0
2015-05-12 04:00:00 70.7
.
.
dtype: float64
</pre>
<h2 id="task-1-data-analysis">TASK 1: Data Analysis</h2>
<h3 id="data-cleaningmunging">Data Cleaning/Munging:</h3>
<p>After having a look at the <b>merged-sensor-files.csv</b> I found out there are some inconsistent rows where header names are repeated and as a result ‘pandas’ is converting all these columns to ‘object’ type. This is quite a common problem, which arises while merging multiple csv files into a single file.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>sensor_data.dtypes
</code></pre>
</div>
<pre>
MTU object
Time object
Power object
Cost object
Voltage object
dtype: object
</pre>
<p>Let’s find out and remove these inconsistent rows so that all the columns can be converted to appropriate data types.</p>
<p>The code below finds all the rows where “Power” column has a string value - “Power” and get the index of these rows.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3</pre></td><td class="code"><pre><span class="c"># Get the inconsistent rows indexes</span>
<span class="n">faulty_row_idx</span> <span class="o">=</span> <span class="n">sensor_data</span><span class="p">[</span><span class="n">sensor_data</span><span class="p">[</span><span class="s">"Power"</span><span class="p">]</span> <span class="o">==</span> <span class="s">" Power"</span><span class="p">]</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="n">faulty_row_idx</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<pre>
[3784,
7582,
11385,
.
.
81617,
85327]
</pre>
<p>and now we can drop these rows from the dataframe</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5</pre></td><td class="code"><pre><span class="c"># Drop these rows from sensor_data dataframe</span>
<span class="n">sensor_data</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">faulty_row_idx</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># This should return an empty list now</span>
<span class="n">sensor_data</span><span class="p">[</span><span class="n">sensor_data</span><span class="p">[</span><span class="s">"Power"</span><span class="p">]</span> <span class="o">==</span> <span class="s">" Power"</span><span class="p">]</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<pre>
[]
</pre>
<p>We have cleaned up the sensor_data and now all the columns can be converted to more appropriate data types.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2