_site/feed.xml

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Rishabh Shukla</title>
    <description>A few reads about Machine Learning and Natural Language Processing</description>
    <link>http://rishy.github.io//</link>
    <atom:link href="http://rishy.github.io//feed.xml" rel="self" type="application/rss+xml" />
    
      <item>
        <title>How to train your Deep Neural Network</title>
        <description>&lt;p&gt;There are certain practices in &lt;strong&gt;Deep Learning&lt;/strong&gt; that are highly recommended, in order to efficiently train &lt;strong&gt;Deep Neural Networks&lt;/strong&gt;. In this post, I will be covering a few of these most commonly used practices, ranging from importance of quality training data, choice of hyperparameters to more general tips for faster prototyping of DNNs. Most of these practices, are validated by the research in academia and industry and are presented with mathematical and experimental proofs in research papers like &lt;a href=&quot;http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf&quot;&gt;Efficient BackProp(Yann LeCun et al.)&lt;/a&gt; and &lt;a href=&quot;https://arxiv.org/pdf/1206.5533v2.pdf&quot;&gt;Practical Recommendations for Deep Architectures(Yoshua Bengio)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As you’ll notice, I haven’t mentioned any mathematical proofs in this post. All the points suggested here, should be taken more of a summarization of the best practices for training DNNs. For more in-depth understanding, I highly recommend you to go through the above mentioned research papers and references provided at the end.&lt;/p&gt;

&lt;hr /&gt;

&lt;h3 id=&quot;training-data&quot;&gt;Training data&lt;/h3&gt;

&lt;p&gt;A lot of ML practitioners are habitual of throwing raw training data in any &lt;strong&gt;Deep Neural Net(DNN)&lt;/strong&gt;. And why not, any DNN would(presumably) still give good results, right? But, it’s not completely old school to say that - “given the right type of data, a fairly simple model will provide better and faster results than a complex DNN”(although, this might have exceptions). So, whether you are working with &lt;strong&gt;Computer Vision&lt;/strong&gt;, &lt;strong&gt;Natural Language Processing&lt;/strong&gt;, &lt;strong&gt;Statistical Modelling&lt;/strong&gt;, etc. try to preprocess your raw data. A few measures one can take to get better training data:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Get your hands on as large a dataset as possible(DNNs are quite data-hungry: more is better)&lt;/li&gt;
  &lt;li&gt;Remove any training sample with corrupted data(short texts, highly distorted images, spurious output labels, features with lots of null values, etc.)&lt;/li&gt;
  &lt;li&gt;Data Augmentation - create new examples(in case of images - rescale, add noise, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;!-- ### Normalize input vectors

A Deep learning model can converge much faster, if the empirical mean of input vectors lies near `0`. [Efficient BackProp](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) elucidated this in detail. Basically, it boils down to the average polarity(positive/negative) of the product of input vector - `x` and weight -  `W`. Hence, during **backpropagation** all of these weights will either decrease or increase; consequently, loss will be optimized in a zig-zag fashion. --&gt;

&lt;h3 id=&quot;choose-appropriate-activation-functions&quot;&gt;Choose appropriate activation functions&lt;/h3&gt;

&lt;p&gt;One of the vital components of any Neural Net are &lt;a href=&quot;https://en.wikipedia.org/wiki/Activation_function&quot;&gt;activation functions&lt;/a&gt;. &lt;strong&gt;Activations&lt;/strong&gt; introduces the much desired &lt;strong&gt;non-linearity&lt;/strong&gt; into the model. For years, &lt;code class=&quot;highlighter-rouge&quot;&gt;sigmoid&lt;/code&gt; activation functions have been the preferable choice. But, a &lt;code class=&quot;highlighter-rouge&quot;&gt;sigmoid&lt;/code&gt; function is inherently cursed by these two drawbacks - 1. Saturation of sigmoids at tails(further causing &lt;a href=&quot;https://en.wikipedia.org/wiki/Vanishing_gradient_problem&quot;&gt;vanishing gradient problem&lt;/a&gt;). 2. &lt;code class=&quot;highlighter-rouge&quot;&gt;sigmoids&lt;/code&gt; are not zero-centered.&lt;/p&gt;

&lt;p&gt;A better alternative is a &lt;code class=&quot;highlighter-rouge&quot;&gt;tanh&lt;/code&gt; function - mathematically, &lt;code class=&quot;highlighter-rouge&quot;&gt;tanh&lt;/code&gt; is just a rescaled and shifted &lt;code class=&quot;highlighter-rouge&quot;&gt;sigmoid&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;tanh(x) = 2*sigmoid(x) - 1&lt;/code&gt;.
 Although &lt;code class=&quot;highlighter-rouge&quot;&gt;tanh&lt;/code&gt; can still suffer from the &lt;strong&gt;vanishing gradient problem&lt;/strong&gt;, but the good news is - &lt;code class=&quot;highlighter-rouge&quot;&gt;tanh&lt;/code&gt; is zero-centered. Hence, using &lt;code class=&quot;highlighter-rouge&quot;&gt;tanh&lt;/code&gt; as activation function will result into faster convergence. I have found that using &lt;code class=&quot;highlighter-rouge&quot;&gt;tanh&lt;/code&gt; as activations generally works better than sigmoid.&lt;/p&gt;

&lt;p&gt;You can further explore other alternatives like &lt;a href=&quot;https://en.wikipedia.org/wiki/Rectifier_(neural_networks)&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;ReLU&lt;/code&gt;&lt;/a&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;SoftSign&lt;/code&gt;, etc. depending on the specific task, which have shown to ameliorate some of these issues.&lt;/p&gt;

&lt;h3 id=&quot;number-of-hidden-units-and-layers&quot;&gt;Number of Hidden Units and Layers&lt;/h3&gt;

&lt;p&gt;Keeping a larger number of hidden units than the optimal number, is generally a safe bet. Since, any regularization method will take care of superfluous units, at least to some extent. On the other hand, while keeping smaller numbers of hidden units(than the optimal number), there are higher chances of underfitting the model.&lt;/p&gt;

&lt;p&gt;Also, while employing &lt;strong&gt;unsupervised pre-trained representations&lt;/strong&gt;(describe in later sections), the optimal number of hidden units are generally kept even larger. Since, pre-trained representation might contain a lot of irrelevant information in these representations(for the specific supervised task). By increasing the number of hidden units, model will have the required flexibility to filter out the most appropriate information out of these pre-trained representations.&lt;/p&gt;

&lt;p&gt;Selecting the optimal number of layers is relatively straight forward. As &lt;a href=&quot;https://www.quora.com/profile/Yoshua-Bengio&quot;&gt;@Yoshua-Bengio&lt;/a&gt; mentioned on Quora - “You just keep on adding layers, until the test error doesn’t improve anymore”. ;)&lt;/p&gt;

&lt;h3 id=&quot;weight-initialization&quot;&gt;Weight Initialization&lt;/h3&gt;

&lt;p&gt;Always initialize the weights with small &lt;code class=&quot;highlighter-rouge&quot;&gt;random numbers&lt;/code&gt; to break the symmetry between different units. But how small should weights be? What’s the recommended upper limit? What probability distribution to use for generating random numbers? Furthermore, while using &lt;code class=&quot;highlighter-rouge&quot;&gt;sigmoid&lt;/code&gt; activation functions, if weights are initialized to very large numbers, then the sigmoid will &lt;strong&gt;saturate&lt;/strong&gt;(tail regions), resulting into &lt;strong&gt;dead neurons&lt;/strong&gt;. If weights are very small, then gradients will also be small. Therefore, it’s preferable to choose weights in an intermediate range, such that these are distributed evenly around a mean value.&lt;/p&gt;

&lt;p&gt;Thankfully, there has been lot of research regarding the appropriate values of initial weights, which is really important for an efficient convergence. 
To initialize the weights that are evenly distributed, a &lt;code class=&quot;highlighter-rouge&quot;&gt;uniform distribution&lt;/code&gt; is probably one of the best choice. Furthermore, as shown in the &lt;a href=&quot;http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf&quot;&gt;paper(Glorot and Bengio, 2010)&lt;/a&gt;, units with more incoming connections(fan_in) should have relatively smaller weights.&lt;/p&gt;

&lt;p&gt;Thanks to all these thorough experiments, now we have a tested formula that we can directly use for weight initialization; i.e. - weights drawn from &lt;code class=&quot;highlighter-rouge&quot;&gt;~ Uniform(-r, r)&lt;/code&gt; where &lt;code class=&quot;highlighter-rouge&quot;&gt;r=sqrt(6/(fan_in+fan_out))&lt;/code&gt; for &lt;code class=&quot;highlighter-rouge&quot;&gt;tanh&lt;/code&gt; activations, and &lt;code class=&quot;highlighter-rouge&quot;&gt;r=4*(sqrt(6/fan_in+fan_out))&lt;/code&gt; for &lt;code class=&quot;highlighter-rouge&quot;&gt;sigmoid&lt;/code&gt; activations, where &lt;code class=&quot;highlighter-rouge&quot;&gt;fan_in&lt;/code&gt; is the size of the previous layer and &lt;code class=&quot;highlighter-rouge&quot;&gt;fan_out&lt;/code&gt; is the size of next layer.&lt;/p&gt;

&lt;h3 id=&quot;learning-rates&quot;&gt;Learning Rates&lt;/h3&gt;

&lt;p&gt;This is probably one of the most important hyperparameter, governing the learning process. Set the learning rate too small and your model might take ages to converge, make it too large and within initial few training examples, your loss might shoot up to sky. Generally, a learning rate of &lt;code class=&quot;highlighter-rouge&quot;&gt;0.01&lt;/code&gt; is a safe bet, but this shouldn’t be taken as a stringent rule; since the optimal learning rate should be in accordance to the specific task.&lt;/p&gt;

&lt;p&gt;In contrast to, a fixed learning rate, gradually decreasing the learning rate, after each epoch or after a few thousand examples is another option. Although this might help in faster training, but requires another manual decision about the new learning rates. Generally, &lt;strong&gt;learning rate can be halved after each epoch&lt;/strong&gt; - these kinds of strategies were quite common a few years back.&lt;/p&gt;

&lt;p&gt;Fortunately, now we have better &lt;code class=&quot;highlighter-rouge&quot;&gt;momentum based methods&lt;/code&gt; to change the learning rate, based on the curvature of the error function. It might also help to set different learning rates for individual parameters in the model; since, some parameters might be learning at a relatively slower or faster rate.&lt;/p&gt;

&lt;p&gt;Lately, there has been a good amount of research on optimization methods, resulting into &lt;code class=&quot;highlighter-rouge&quot;&gt;adaptive learning rates&lt;/code&gt;. At this moment, we have numerous options starting from good old &lt;code class=&quot;highlighter-rouge&quot;&gt;Momentum Method&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;Adagrad&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;Adam&lt;/code&gt;(personal favourite ;)), &lt;code class=&quot;highlighter-rouge&quot;&gt;RMSProp&lt;/code&gt; etc. Methods like &lt;code class=&quot;highlighter-rouge&quot;&gt;Adagrad&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;Adam&lt;/code&gt;, effectively save us from manually choosing an &lt;code class=&quot;highlighter-rouge&quot;&gt;initial learning rate&lt;/code&gt;, and given the right amount of time, the model will start to converge quite smoothly(of course, still selecting a good initial rate will further help).&lt;/p&gt;

&lt;h3 id=&quot;hyperparameter-tuning-spun-grid-search---embrace-random-search&quot;&gt;Hyperparameter Tuning: Spun Grid Search - Embrace Random Search&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Grid Search&lt;/strong&gt; has been prevalent in classical machine learning. But, Grid Search is not at all efficient in finding optimal hyperparameters for DNNs. Primarily, because of the time taken by a DNN in trying out different hyperparameter combinations. As the number of hyperparameters keeps on increasing, computation required for Grid Search also increases exponentially.&lt;/p&gt;

&lt;p&gt;There are two ways to go about it:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Based on your prior experience, you can manually tune some common hyperparameters like learning rate, number of layers, etc.&lt;/li&gt;
  &lt;li&gt;Instead of Grid Search, use &lt;strong&gt;Random Search/Random Sampling&lt;/strong&gt; for choosing optimal hyperparameters. A combination of hyperparameters is generally choosen from a &lt;strong&gt;uniform distribution&lt;/strong&gt; within the desired range. It is also possible to add some prior knowledge to further decrease the search space(like learning rate shouldn’t be too large or too small). Random Search has been found to be way more efficient compared to Grid Search.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;learning-methods&quot;&gt;Learning Methods&lt;/h3&gt;

&lt;p&gt;Good old &lt;strong&gt;Stochastic Gradient Descent&lt;/strong&gt; might not be as efficient for DNNs(again, not a stringent rule), lately there have been a lot of research to develop more flexible optimization algorithms. For e.g.: &lt;code class=&quot;highlighter-rouge&quot;&gt;Adagrad&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;Adam&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;AdaDelta&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;RMSProp&lt;/code&gt;, etc. In addition to providing &lt;strong&gt;adaptive learning rates&lt;/strong&gt;, these sophisticated methods also use &lt;strong&gt;different rates for different model parameters&lt;/strong&gt; and this generally results into a smoother convergence. It’s good to consider these as hyper-parameters and one should always try out a few of these on a subset of training data.&lt;/p&gt;

&lt;h3 id=&quot;keep-dimensions-of-weights-in-the-exponential-power-of-2&quot;&gt;Keep dimensions of weights in the exponential power of 2&lt;/h3&gt;

&lt;p&gt;Even, when dealing with &lt;strong&gt;state-of-the-art&lt;/strong&gt; Deep Learning Models with latest hardware resources, &lt;strong&gt;memory management&lt;/strong&gt; is still done at the byte level; So, it’s always good to keep the size of your parameters as &lt;code class=&quot;highlighter-rouge&quot;&gt;64&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;128&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;512&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;1024&lt;/code&gt;(all powers of &lt;code class=&quot;highlighter-rouge&quot;&gt;2&lt;/code&gt;). This might help in sharding the matrices, weights, etc. resulting into slight boost in learning efficiency. This becomes even more significant when dealing with &lt;strong&gt;GPUs&lt;/strong&gt;.&lt;/p&gt;

&lt;h3 id=&quot;unsupervised-pretraining&quot;&gt;Unsupervised Pretraining&lt;/h3&gt;

&lt;p&gt;Doesn’t matter whether you are working with NLP, Computer Vision, Speech Recognition, etc. &lt;strong&gt;Unsupervised Pretraining&lt;/strong&gt; always help the training of your supervised or other unsupervised models. &lt;strong&gt;Word Vectors&lt;/strong&gt; in NLP are ubiquitous; you can use &lt;a href=&quot;http://image-net.org/&quot;&gt;ImageNet&lt;/a&gt; dataset to pretrain your model in an unsupervised manner, for a 2-class supervised classification; or audio samples from a much larger domain to further use that information for a speaker disambiguation model.&lt;/p&gt;

&lt;h3 id=&quot;mini-batch-vs-stochastic-learning&quot;&gt;Mini-Batch vs. Stochastic Learning&lt;/h3&gt;

&lt;p&gt;Major objective of training a model is to learn appropriate parameters, that results into an optimal mapping from inputs to outputs. These parameters are tuned with each training sample, irrespective of your decision to use &lt;strong&gt;batch&lt;/strong&gt;, &lt;strong&gt;mini-batch&lt;/strong&gt; or &lt;strong&gt;stochastic learning&lt;/strong&gt;. While employing a stochastic learning approach, gradients of weights are tuned after each training sample, introducing noise into gradients(hence the word ‘stochastic’). This has a very desirable effect; i.e. - with the introduction of &lt;strong&gt;noise&lt;/strong&gt; during the training, the model becomes less prone to overfitting.&lt;/p&gt;

&lt;p&gt;However, going through the stochastic learning approach might be relatively less efficient; since now a days machines have far more computation power. Stochastic learning might effectively waste a large portion of this. If we are capable of computing &lt;strong&gt;Matrix-Matrix multiplication&lt;/strong&gt;, then why should we limit ourselves, to iterate through the multiplications of individual pairs of &lt;strong&gt;Vectors&lt;/strong&gt;? Therefore, for greater throughput/faster learning, it’s recommended to use mini-batches instead of stochastic learning.&lt;/p&gt;

&lt;p&gt;But, selecting an appropriate batch size is equally important; so that we can still retain some noise(by not using a huge batch) and simultaneously use the computation power of machines more effectively. Commonly, a batch of &lt;code class=&quot;highlighter-rouge&quot;&gt;16&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;128&lt;/code&gt; examples is a good choice(exponential of &lt;code class=&quot;highlighter-rouge&quot;&gt;2&lt;/code&gt;). Usually, batch size is selected, once you have already found more important hyperparameters(by &lt;strong&gt;manual search&lt;/strong&gt; or &lt;strong&gt;random search&lt;/strong&gt;). Nevertheless, there are scenarios when the model is getting the training data as a stream(&lt;a href=&quot;https://en.wikipedia.org/wiki/Online_machine_learning&quot;&gt;online learning&lt;/a&gt;), then resorting to Stochastic Learning is a good option.&lt;/p&gt;

&lt;h3 id=&quot;shuffling-training-examples&quot;&gt;Shuffling training examples&lt;/h3&gt;

&lt;p&gt;This comes from &lt;strong&gt;Information Theory&lt;/strong&gt; - “Learning that an unlikely event has occurred is more informative than learning that a likely event has occurred”. Similarly, randomizing the order of training examples(in different epochs, or mini-batches) will result in faster convergence. A slight boost is always noticed when the model doesn’t see a lot of examples in the same order.&lt;/p&gt;

&lt;h3 id=&quot;dropout-for-regularization&quot;&gt;Dropout for Regularization&lt;/h3&gt;

&lt;p&gt;Considering, millions of parameters to be learned, regularization becomes an imperative requisite to prevent &lt;strong&gt;overfitting&lt;/strong&gt; in DNNs. You can keep on using &lt;strong&gt;L1/L2&lt;/strong&gt; regularization as well, but &lt;strong&gt;Dropout&lt;/strong&gt; is preferable to check overfitting in DNNs. Dropout is trivial to implement and generally results into faster learning. A default value of &lt;code class=&quot;highlighter-rouge&quot;&gt;0.5&lt;/code&gt; is a good choice, although this depends on the specific task,. If the model is less complex, then a dropout of &lt;code class=&quot;highlighter-rouge&quot;&gt;0.2&lt;/code&gt; might also suffice.&lt;/p&gt;

&lt;p&gt;Dropout should be turned off, during the test phase, and weights should be scaled accordingly, as done in the &lt;a href=&quot;https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf&quot;&gt;original paper&lt;/a&gt;. Just allow a model with Dropout regularization, a little bit more training time; and the error will surely go down.&lt;/p&gt;

&lt;h3 id=&quot;number-of-epochstraining-iterations&quot;&gt;Number of Epochs/Training Iterations&lt;/h3&gt;

&lt;p&gt;“Training a Deep Learning Model for multiple epochs will result in a better model” - we have heard it a couple of times, but how do we quantify “many”?
Turns out, there is a simple strategy for this - Just keep on training your model for a fixed amount of examples/epochs, let’s say &lt;code class=&quot;highlighter-rouge&quot;&gt;20,000&lt;/code&gt; examples or &lt;code class=&quot;highlighter-rouge&quot;&gt;1&lt;/code&gt; epoch. After each set of these examples compare the &lt;strong&gt;test error&lt;/strong&gt; with &lt;strong&gt;train error&lt;/strong&gt;, if the gap is decreasing, then keep on training. In addition to this, after each such set, save a copy of your model parameters(so that you can choose from multiple models once it is trained).&lt;/p&gt;

&lt;h3 id=&quot;visualize&quot;&gt;Visualize&lt;/h3&gt;

&lt;p&gt;There are a thousand ways in which the training of a deep learning model might go wrong. I guess we have all been there, when the model is being trained for hours or days and only after the training is finished, we realize something went wrong. In order to save yourself from bouts of hysteria, in such situations(which might be quite justified ;)) - &lt;strong&gt;always visualize the training process&lt;/strong&gt;. Most obvious step you can take is to &lt;strong&gt;print/save logs&lt;/strong&gt; of &lt;code class=&quot;highlighter-rouge&quot;&gt;loss&lt;/code&gt; values, &lt;code class=&quot;highlighter-rouge&quot;&gt;train error&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;test error&lt;/code&gt;, etc.&lt;/p&gt;

&lt;p&gt;In addition to this, another good practice is to use a visualization library to plot histograms of weights after few training examples or between epochs. This might help in keeping track of some of the common problems in Deep Learning Models like &lt;strong&gt;Vanishing Gradient&lt;/strong&gt;, &lt;strong&gt;Exploding Gradient&lt;/strong&gt; etc.&lt;/p&gt;

&lt;h3 id=&quot;multi-core-machines-gpus&quot;&gt;Multi-Core machines, GPUs&lt;/h3&gt;

&lt;p&gt;Advent of GPUs, libraries that provide vectorized operations, machines with more computation power, are probably some of the most significant factors in the success of Deep Learning. If you think, you are patient as a stone, you might try running a DNN on your laptop(which can’t even open 10 tabs in your Chrome browser) and wait for ages to get your results. Or you can play smart(and expensively :z) and get a descent hardware with at least &lt;strong&gt;multiple CPU cores&lt;/strong&gt; and a &lt;strong&gt;few hundred GPU cores&lt;/strong&gt;. GPUs have revolutionized the Deep Learning research(no wonder Nvidia’s stocks are shooting up ;)), primarily because of their ability to perform Matrix Operations at a larger scale.&lt;/p&gt;

&lt;p&gt;So, instead of taking weeks on a normal machine, these parallelization techniques, will bring down the training time to days, if not hours.&lt;/p&gt;

&lt;h3 id=&quot;use-libraries-with-gpu-and-automatic-differentiation-support&quot;&gt;Use libraries with GPU and Automatic Differentiation Support&lt;/h3&gt;

&lt;p&gt;Thankfully, for rapid prototyping we have some really descent libraries like &lt;a href=&quot;http://deeplearning.net/software/theano/&quot;&gt;Theano&lt;/a&gt;, &lt;a href=&quot;https://www.tensorflow.org/&quot;&gt;Tensorflow&lt;/a&gt;, &lt;a href=&quot;https://keras.io/&quot;&gt;Keras&lt;/a&gt;, etc. Almost all of these DL libraries provide &lt;strong&gt;support for GPU computation&lt;/strong&gt; and &lt;strong&gt;Automatic Differentiation&lt;/strong&gt;. So, you don’t have to dive into core GPU programming(unless you want to - it’s definitely fun :)); nor you have to write your own differentiation code, which might get a little bit taxing in really complex models(although you should be able to do that, if required). Tensorflow further provides support for training your models on a &lt;strong&gt;distributed architecture&lt;/strong&gt;(if you can afford it).&lt;/p&gt;

&lt;p&gt;This is not at all an exhaustive list of practices, to train a DNN. In order to include just the most common practices, I have tried to exclude a few concepts like Normalization of inputs, Batch/Layer Normalization, Gradient Check, etc. Although feel free to add anything in the comment section and I’ll be more than happy to update it in the post. :)&lt;/p&gt;

&lt;h3 id=&quot;references&quot;&gt;References:&lt;/h3&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;a href=&quot;http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf&quot;&gt;Efficient BackProp(Yann LeCun et al.)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/pdf/1206.5533v2.pdf&quot;&gt;Practical Recommendations for Deep Architectures(Yoshua Bengio)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf&quot;&gt;Understanding the difficulty of training deep feedforward neural networks(Glorot and Bengio, 2010)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf&quot;&gt;Dropout: A Simple Way to Prevent Neural Networks from Overfitting&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b#.yd17cx8ml&quot;&gt;Andrej Karpathy - Yes you should understand backprop(Medium)&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;!-- % if page.comments % --&gt;

&lt;div id=&quot;disqus_thread&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;
    /* * * CONFIGURATION VARIABLES * * */
    var disqus_shortname = 'rishabhshukla';
    
    /* * * DON'T EDIT BELOW THIS LINE * * */
    (function() {
        var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
        dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
        (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
    })();
&lt;/script&gt;

&lt;noscript&gt;Please enable JavaScript to view the &lt;a href=&quot;https://disqus.com/?ref_noscript&quot; rel=&quot;nofollow&quot;&gt;comments powered by Disqus.&lt;/a&gt;&lt;/noscript&gt;

&lt;!-- % endif % --&gt;
</description>
        <pubDate>Thu, 05 Jan 2017 14:30:05 +0530</pubDate>
        <link>http://rishy.github.io//ml/2017/01/05/how-to-train-your-dnn/</link>
        <guid isPermaLink="true">http://rishy.github.io//ml/2017/01/05/how-to-train-your-dnn/</guid>
      </item>
    
      <item>
        <title>Dropout with Theano</title>
        <description>&lt;p&gt;Almost everyone working with Deep Learning would have heard a smattering about &lt;strong&gt;Dropout&lt;/strong&gt;. Albiet a simple concept(&lt;a href=&quot;https://arxiv.org/pdf/1207.0580v1.pdf&quot;&gt;introduced&lt;/a&gt; a couple of years ago), which sounds like a pretty obvious way for model averaging, further resulting into a more generalized and regularized Neural Net; still when you actually get into the nitty-gritty details of implementing it in your favourite library(theano being mine), you might find some roadblocks there. Why? Because it’s not exactly straight-forward to randomly deactivate some neurons in a DNN.&lt;/p&gt;

&lt;p&gt;In this post, we’ll just recapitulate what has already been explained in detail about Dropout in lot of papers and online resources(some of these are provided at the end of the post). Our main focus will be on implementing a Dropout layer in &lt;a href=&quot;https://docs.scipy.org/doc/numpy-dev/user/quickstart.html&quot;&gt;Numpy&lt;/a&gt; and &lt;a href=&quot;http://deeplearning.net/software/theano/introduction.html&quot;&gt;Theano&lt;/a&gt;, while taking care of all the related caveats. You can find the Jupyter Notebook with the Dropout Class &lt;a href=&quot;http://nbviewer.ipython.org/github/rishy/rishy.github.io/blob/master/ipy_notebooks/Dropout-Theano.ipynb&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Regularization is a technique to prevent &lt;a href=&quot;https://en.wikipedia.org/wiki/Overfitting&quot;&gt;Overfitting&lt;/a&gt; in a machine learning model. Considering the fact that a DNN has a highly complex function to fit, it can easily overfit with a small/intermediate size of dataset.&lt;/p&gt;

&lt;p&gt;In very simple terms - &lt;em&gt;Dropout is a highly efficient regularization technique, wherein, for each iteration we randomly remove some of the neurons in a DNN&lt;/em&gt;(along with their connections; have a look at Fig. 1). So how does this help in regularizing a DNN? Well, by randomly removing some of the cells in the computational graph(Neural Net), we are preventing some of the neurons(which are basically hidden features in a Neural Net) from overfitting on all of the training samples. So, this is more like just considering only a handful of features(neurons) for each training sample and producing the output based on these features only. This results into a completely different neural net(hopefully ;)) for each training sample, and eventually our output is the average of these different nets(any &lt;code class=&quot;highlighter-rouge&quot;&gt;Random Forests&lt;/code&gt;-phile here? :D).&lt;/p&gt;

&lt;h2 id=&quot;graphical-overview&quot;&gt;Graphical Overview:&lt;/h2&gt;

&lt;p&gt;In Fig. 1, we have a fully connected deep neural net on the left side, where each neuron is connected to neurons in its upper and lower layers. On the right side, we have randomly omitted some neurons along with their connections. For every learning step, Neural net in Fig. 2 will have a different representation. Consequently, only the connected neurons and their weights will be learned in a particular learning step.&lt;/p&gt;

&lt;p style=&quot;display: flex;&quot;&gt;
&lt;img src=&quot;../../../../../images/nn.png&quot; style=&quot;height: 45%; width: 45%&quot; /&gt;
&lt;img src=&quot;../../../../../images/dropout-nn.png&quot; style=&quot;height: 45%; width: 45%&quot; /&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;
Fig. 1&lt;br /&gt;
&lt;span style=&quot;color: #000; font-size: 1rem;&quot;&gt;
Left: DNN without Dropout, Right: DNN with some dropped neurons
&lt;/span&gt;
&lt;/p&gt;

&lt;h2 id=&quot;theano-implementation&quot;&gt;Theano Implementation:&lt;/h2&gt;

&lt;p&gt;Let’s dive straight into the code for implementing a Dropout layer. If you don’t have prior knowledge of Theano and Numpy, then please go through these two awesome blogs by &lt;a href=&quot;https://twitter.com/dennybritz&quot;&gt;@dennybritz&lt;/a&gt; - &lt;a href=&quot;http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/&quot;&gt;Implementing a Neural Network from Scratch&lt;/a&gt; and &lt;a href=&quot;http://www.wildml.com/2015/09/speeding-up-your-neural-network-with-theano-and-the-gpu/&quot;&gt;Speeding up your neural network with theano and gpu&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As recommended, whenever we are dealing with Random numbers, it is advisable to set a random seed.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;np&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;theano.sandbox.rng_mrg&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MRG_RandomStreams&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RandomStreams&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;theano&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Set seed for the random numbers&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rng&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RandomState&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Generate a theano RandomStreams&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;srng&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RandomStreams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rng&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;999999&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Let’s enumerate through each line in the above code. Firstly, we import all the necessary modules(more about &lt;code class=&quot;highlighter-rouge&quot;&gt;RandomStreams&lt;/code&gt; in the next few lines) and initialize the random seed, so the random numbers generated are consistent in each different run. On the second line we create an object &lt;code class=&quot;highlighter-rouge&quot;&gt;rng&lt;/code&gt; of &lt;code class=&quot;highlighter-rouge&quot;&gt;numpy.random.RandomState&lt;/code&gt;, this exposes a number of methods for generating random numbers, drawn from a variety of probability distributions.&lt;/p&gt;

&lt;p&gt;Theano is designed in a functional manner, as a result of this generating random numbers in Theano Computation graphs is a bit tricky compared to Numpy. Using Random Variables with Theano is equivalent to imputing random variables in the Computation graph. Theano will allocate a numpy &lt;code class=&quot;highlighter-rouge&quot;&gt;RandomState&lt;/code&gt; object for each such variable, and draw from it as necessary. Theano calls this sort of sequence of random numbers a &lt;code class=&quot;highlighter-rouge&quot;&gt;Random Stream&lt;/code&gt;. The &lt;code class=&quot;highlighter-rouge&quot;&gt;MRG_RandomStreams&lt;/code&gt; we are using is another implementation of &lt;code class=&quot;highlighter-rouge&quot;&gt;RandomStreams&lt;/code&gt; in Theano, which works for GPUs as well.&lt;/p&gt;

&lt;p&gt;So, finally we create a &lt;code class=&quot;highlighter-rouge&quot;&gt;srng&lt;/code&gt; object which will provide us with Random Streams in each run of our Optimization Function.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;dropit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;srng&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;# proportion of probability to retain&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;retain_prob&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;# a masking variable&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;mask&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;srng&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;binomial&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;retain_prob&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                         &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'floatX'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c&quot;&gt;# final weight with dropped neurons&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;theano&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                             &lt;span class=&quot;n&quot;&gt;theano&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;config&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;floatX&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Here is our main Dropout function with three arguments: &lt;code class=&quot;highlighter-rouge&quot;&gt;srng&lt;/code&gt; - A RandomStream generator, &lt;code class=&quot;highlighter-rouge&quot;&gt;weight&lt;/code&gt; - Any theano tensor(Weights of a Neural Net), and &lt;code class=&quot;highlighter-rouge&quot;&gt;drop&lt;/code&gt; - a float value to denote the proportion of neurons to drop. So, naturally number of neurons to retain will be &lt;code class=&quot;highlighter-rouge&quot;&gt;1 - drop&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;On the second line in the function, we are generating a RandomStream from &lt;a href=&quot;https://en.wikipedia.org/wiki/Binomial_distribution&quot;&gt;Binomial Distribution&lt;/a&gt;, where &lt;code class=&quot;highlighter-rouge&quot;&gt;n&lt;/code&gt; denotes the number of trials, &lt;code class=&quot;highlighter-rouge&quot;&gt;p&lt;/code&gt; is the probability with which to retain the neurons and &lt;code class=&quot;highlighter-rouge&quot;&gt;size&lt;/code&gt; is the shape of the output. As the final step, all we need to do is to switch the value of some of the neurons to &lt;code class=&quot;highlighter-rouge&quot;&gt;0&lt;/code&gt;, which can be accomplished by simply multiplying &lt;code class=&quot;highlighter-rouge&quot;&gt;mask&lt;/code&gt; with the &lt;code class=&quot;highlighter-rouge&quot;&gt;weight&lt;/code&gt; tensor/matrix. &lt;code class=&quot;highlighter-rouge&quot;&gt;theano.tensor.cast&lt;/code&gt; is further type casting the resulting value to the value of &lt;code class=&quot;highlighter-rouge&quot;&gt;theano.config.floatX&lt;/code&gt;, which is either the default value of &lt;code class=&quot;highlighter-rouge&quot;&gt;floatX&lt;/code&gt;, which is &lt;code class=&quot;highlighter-rouge&quot;&gt;float32&lt;/code&gt; in theano or any other value that we might have mentioned in &lt;code class=&quot;highlighter-rouge&quot;&gt;.theanorc&lt;/code&gt; configuration file.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;dont_dropit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;theano&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;theano&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;config&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;floatX&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now, one thing to keep in mind is - we only want to drop neurons during the training phase and not during the validation or test phase. Also, we need to somehow compensate for the fact that during the training time we deactivated some of the neurons. There are two ways to achieve this:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Scaling the Weights&lt;/strong&gt;(implemented at the test phase): Since, our resulting Neural Net is an averaged model, it makes sense to use the averaged value of the weights during the test phase, considering the fact that we are not deactivating any neurons here. The easiest way to do this is to scale the weights(which acts as averaging) by the factor of retained probability, in the training phase. This is exactly what we are doing in the above function.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Inverted Dropout&lt;/strong&gt;(implemented at the training phase): Now scaling the weights has its caveats, since we have to tweak the weights at the test time. On the other end ‘Inverted Dropout’ performs the scaling at the training time. So, we don’t have to tweak the test code whenever we decide to change the order of Dropout layer. In this post, we’ll be using the first method(scaling), although I’d recommend you to play with Inverted Dropout as well. You can follow &lt;a href=&quot;https://github.com/cs231n/cs231n.github.io/blob/master/neural-networks-2.md#reg&quot;&gt;this&lt;/a&gt; up for the guidance.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;dropout_layer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;theano&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ifelse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ifelse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;theano&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;eq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;dropit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dont_dropit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Our final &lt;code class=&quot;highlighter-rouge&quot;&gt;dropout_layer&lt;/code&gt; function uses &lt;code class=&quot;highlighter-rouge&quot;&gt;theano.ifelse&lt;/code&gt; module to return the value of either &lt;code class=&quot;highlighter-rouge&quot;&gt;dropit&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;dont_dropit&lt;/code&gt; function. This is conditioned on whether our &lt;code class=&quot;highlighter-rouge&quot;&gt;train&lt;/code&gt; flag is on or off. So, while the model is in training phase, we’ll use dropout for our model weights and in test phase, we would simply scale the weights to compensate for all the training steps, where we omitted some random neurons.&lt;/p&gt;

&lt;p&gt;Finally, here’s how you can add a Dropout layer in your DNN. I am taking an example of RNN, similar to the one used in &lt;a href=&quot;http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/&quot;&gt;this&lt;/a&gt; blog:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ivector&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'x'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;drop_value&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scalar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'drop_value'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Dropout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;gru&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GRU&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#An object of GRU class with required arguments&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OrderedDict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;#A dictionary of model parameters&lt;/span&gt;
    
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;forward_prop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s_t_prev&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop_value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;E&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;U&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    
    &lt;span class=&quot;c&quot;&gt;# Word vector embeddings&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;x_e&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;E&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; 
    
    &lt;span class=&quot;c&quot;&gt;# GRU Layer&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;W&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dropout_layer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop_value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;U&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dropout_layer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;U&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop_value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gru&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GRU_layer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s_t_prev&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;U&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s_t&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;updates&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;theano&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scan&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;forward_prop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
             &lt;span class=&quot;n&quot;&gt;sequences&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
             &lt;span class=&quot;n&quot;&gt;non_sequences&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;drop_value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'E'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
                             &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'U'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'W'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'b'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt;
             &lt;span class=&quot;n&quot;&gt;outputs_info&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;initial&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;zeros&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hidden_dim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))])&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Here, we have the &lt;code class=&quot;highlighter-rouge&quot;&gt;forward_prop&lt;/code&gt; function for RNN+GRU model. Starting from the first line, we are creating a theano tensor variable &lt;code class=&quot;highlighter-rouge&quot;&gt;x&lt;/code&gt;, for input(words) and another &lt;code class=&quot;highlighter-rouge&quot;&gt;drop_value&lt;/code&gt; variable of type &lt;code class=&quot;highlighter-rouge&quot;&gt;theano.tensor.scalar&lt;/code&gt;, which will take a float value to denote the proportion of neurons to be dropped.&lt;/p&gt;

&lt;p&gt;Then we are creating an object &lt;code class=&quot;highlighter-rouge&quot;&gt;dropout&lt;/code&gt; of the &lt;code class=&quot;highlighter-rouge&quot;&gt;Dropout&lt;/code&gt; class, we implemented in previous sections. After this, we are initiating a &lt;code class=&quot;highlighter-rouge&quot;&gt;GRU&lt;/code&gt; object(I have kept this as a generic class, since you might have a different implementation). We also have one more variable, namely &lt;code class=&quot;highlighter-rouge&quot;&gt;params&lt;/code&gt; which is an &lt;code class=&quot;highlighter-rouge&quot;&gt;OrderedDict&lt;/code&gt; containing the model parameters.&lt;/p&gt;

&lt;p&gt;Furthermore, &lt;code class=&quot;highlighter-rouge&quot;&gt;E&lt;/code&gt; is our Word Embedding Matrix, &lt;code class=&quot;highlighter-rouge&quot;&gt;U&lt;/code&gt; contains, input to hidden layer weights, &lt;code class=&quot;highlighter-rouge&quot;&gt;W&lt;/code&gt; is the hidden to hidden layer weights and &lt;code class=&quot;highlighter-rouge&quot;&gt;b&lt;/code&gt; is the bias. Then we have our workhorse - the &lt;code class=&quot;highlighter-rouge&quot;&gt;forward_prop&lt;/code&gt; function, which is called iteratively for each value in &lt;code class=&quot;highlighter-rouge&quot;&gt;x&lt;/code&gt; variable(here these values will be the indexes for sequential words in the text). Now, all we have to do is call the &lt;code class=&quot;highlighter-rouge&quot;&gt;dropout_layer&lt;/code&gt; function from &lt;code class=&quot;highlighter-rouge&quot;&gt;forward_prop&lt;/code&gt;, which will return &lt;code class=&quot;highlighter-rouge&quot;&gt;W&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;U&lt;/code&gt;, with few dropped neurons.&lt;/p&gt;

&lt;p&gt;This is it in terms of implementing and using a dropout layer with Theano. Although, there are a few things mentioned in the next section, which you have to keep in mind when working with &lt;code class=&quot;highlighter-rouge&quot;&gt;RandomStreams&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;few-things-to-take-care-of&quot;&gt;Few things to take care of:&lt;/h2&gt;

&lt;p&gt;&lt;b&gt;Wherever we are going to use a &lt;code class=&quot;highlighter-rouge&quot;&gt;theano.function&lt;/code&gt; after this, we’ll have to explicitly pass it the &lt;code class=&quot;highlighter-rouge&quot;&gt;updates&lt;/code&gt;, we got from &lt;code class=&quot;highlighter-rouge&quot;&gt;theano.scan&lt;/code&gt; function in previous section. Reason?&lt;/b&gt;
Whenever there is a call to theano’s &lt;code class=&quot;highlighter-rouge&quot;&gt;RandomStreams&lt;/code&gt;, it throws some updates, and all of the theano functions, following the above code, should be made aware of these updates. So let’s have a look at this code:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;o&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nnet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;softmax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tanh&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'V'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])))&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;prediction&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argmax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# cost/loss function&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;loss&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nnet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;categorical_crossentropy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# cast values in 'updates' variable to a list&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;updates&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;updates&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;items&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# couple of commonly used theano functions with 'updates    '&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;theano&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;updates&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;updates&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;predict_class&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;theano&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;prediction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;updates&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;updates&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;loss&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;theano&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;loss&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;updates&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;updates&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;As a standard procedure, we are using another model parameter &lt;code class=&quot;highlighter-rouge&quot;&gt;V&lt;/code&gt;(hidden to output) and taking a &lt;code class=&quot;highlighter-rouge&quot;&gt;softmax&lt;/code&gt; over this. If you have a look at &lt;code class=&quot;highlighter-rouge&quot;&gt;predict&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;loss&lt;/code&gt; functions, then we had to explicitly, tell them about the &lt;code class=&quot;highlighter-rouge&quot;&gt;updates&lt;/code&gt; that &lt;code class=&quot;highlighter-rouge&quot;&gt;RandomStreams&lt;/code&gt; made during the execution of &lt;code class=&quot;highlighter-rouge&quot;&gt;dropout_layer&lt;/code&gt; function. Else, this will throw an error in Theano.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;What is the appropriate float value for dropout?&lt;/b&gt;
To be on the safe side, a value of &lt;code class=&quot;highlighter-rouge&quot;&gt;0.5&lt;/code&gt;(as mentioned in the original &lt;a href=&quot;https://arxiv.org/pdf/1207.0580v1.pdf&quot;&gt;paper&lt;/a&gt;) is generally good enough. Although, you could always try to tweak it a bit and see what works best for your model.&lt;/p&gt;

&lt;h2 id=&quot;alternatives-to-dropout&quot;&gt;Alternatives to Dropout&lt;/h2&gt;
&lt;p&gt;Lately, there has been a lot of research for better regularization methods in DNNs. One of the things that I really like about Dropout is that it’s conceptually very simple as well as an highly effective way to prevent overfitting. A few more methods, that are increasingly being used in DNNs now a days(I am omitting the standard L1/L2 regularization here):&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Batch Normalization:&lt;/strong&gt;
Batch Normalization primarily tackles the problem of &lt;em&gt;internal covariate shift&lt;/em&gt; by normalizing the weights in each mini-batch. So, in addition to simply using normalized weights at the beginning of the training process, Batch Normalization will keep on normalizing them during the whole training phase. This accelerates the optimization process and as a side product, might also eliminate the need of Dropout. Have a look at the original &lt;a href=&quot;https://arxiv.org/pdf/1502.03167.pdf&quot;&gt;paper&lt;/a&gt; for more in-depth explanation.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Max-Norm:&lt;/strong&gt; 
Max-Norm puts a specific upper bound on the magnitude of weight matrices and if the magnitude exceeds this threshold then the values of weight matrices are clipped down. This is particularly helpful for exploding gradient problem.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;DropConnect:&lt;/strong&gt;
When training with Dropout, a randomly selected subset of activations are set to zero within each layer. DropConnect instead sets a randomly selected subset of weights within the network to zero. Each unit thus receives input from a random subset of units in the previous layer. We derive a bound on the generalization performance of both Dropout and DropConnect. - Abstract from the original &lt;a href=&quot;https://cs.nyu.edu/~wanli/dropc/dropc.pdf&quot;&gt;paper&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;ZoneOut(specific to RNNs):&lt;/strong&gt; 
In each training step, ZoneOut keeps the value of some of the hidden units unchanged. So, instead of throwing out the information, it enforces a random number of hidden units to propogate the same information in next time step.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The reason I wanted to write about this, is because if you are working with a low level library like Theano, then sometimes using modules like &lt;code class=&quot;highlighter-rouge&quot;&gt;RandomStreams&lt;/code&gt; might get a bit tricky. Although, for prototyping and even for production purposes, you should also consider other high level libraries like &lt;a href=&quot;https://keras.io/&quot;&gt;Keras&lt;/a&gt; and &lt;a href=&quot;https://www.tensorflow.org/&quot;&gt;TensorFlow&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Feel free, to add any other regularization methods and feedbacks, in the comments section.&lt;/p&gt;

&lt;p&gt;Suggested Readings:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/&quot;&gt;Implementing a Neural Network From Scratch - Wildml&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/&quot;&gt;Introduction to Recurrent Neural Networks - Wildml&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/pdf/1207.0580v1.pdf&quot;&gt;Improving neural networks by preventing co-adaptation of feature detectors&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf&quot;&gt;Dropout: A Simple Way to Prevent Neural Networks from Overfitting&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://wiki.ubc.ca/Course:CPSC522/Regularization_for_Neural_Networks&quot;&gt;Regularization for Neural Networks&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://wikicoursenote.com/wiki/Dropout&quot;&gt;Dropout - WikiCourse&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://papers.nips.cc/paper/4124-practical-large-scale-optimization-for-max-norm-regularization.pdf&quot;&gt;Practical large scale optimization for Max Norm Regularization&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://cs.nyu.edu/~wanli/dropc/dropc.pdf&quot;&gt;DropConnect Paper&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/1606.01305&quot;&gt;ZoneOut Paper&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/cs231n/cs231n.github.io/blob/master/neural-networks-2.md#reg&quot;&gt;Regularization in Neural Networks&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/pdf/1502.03167.pdf&quot;&gt;Batch Normalization Paper&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;!-- % if page.comments % --&gt;

&lt;div id=&quot;disqus_thread&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;
    /* * * CONFIGURATION VARIABLES * * */
    var disqus_shortname = 'rishabhshukla';
    
    /* * * DON'T EDIT BELOW THIS LINE * * */
    (function() {
        var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
        dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
        (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
    })();
&lt;/script&gt;

&lt;noscript&gt;Please enable JavaScript to view the &lt;a href=&quot;https://disqus.com/?ref_noscript&quot; rel=&quot;nofollow&quot;&gt;comments powered by Disqus.&lt;/a&gt;&lt;/noscript&gt;

&lt;!-- % endif % --&gt;
</description>
        <pubDate>Wed, 12 Oct 2016 17:30:05 +0530</pubDate>
        <link>http://rishy.github.io//ml/2016/10/12/dropout-with-theano/</link>
        <guid isPermaLink="true">http://rishy.github.io//ml/2016/10/12/dropout-with-theano/</guid>
      </item>
    
      <item>
        <title>L1 vs. L2 Loss function</title>
        <description>&lt;p&gt;Least absolute deviations(L1) and Least square errors(L2) are the two standard loss functions, that decides what function should be minimized while learning from a dataset.&lt;/p&gt;

&lt;p&gt;L1 Loss function minimizes the &lt;b&gt;absolute differences&lt;/b&gt; between the estimated values and the existing target values. So, summing up each target &amp;lt;/span&amp;gt; value &lt;span&gt;\( y_i \)&lt;/span&gt; and corresponding estimated value &lt;span&gt;\( h(x_i) \)&lt;/span&gt;, where &lt;span&gt;\( x_i \)&lt;/span&gt; denotes the feature set of a single sample, Sum of absolute differences for ‘n’ samples can be calculated as,&lt;/p&gt;

&lt;div&gt;
$$
\begin{align*}
  &amp;amp; S = \sum_{i=0}^n|y_i - h(x_i)|
\end{align*}
$$
&lt;/div&gt;

&lt;p&gt;On the other hand, L2 loss function minimizes the &lt;b&gt;squared differences&lt;/b&gt; between the estimated and existing target values.&lt;/p&gt;

&lt;div&gt;
$$
\begin{align*}
  &amp;amp; S = \sum_{i=0}^n(y_i - h(x_i))^2
\end{align*}
$$
&lt;/div&gt;

&lt;p&gt;As apparent from above formulae that L2 error will be much larger in the case of outliers compared to L1. Since, the difference between an incorrectly predicted target value and original target value will be quite large and squaring it will make it even larger.&lt;/p&gt;

&lt;p&gt;As a result, L1 loss function is more robust and is generally not affected by outliers. On the contrary L2 loss function will try to adjust the model according to these outlier values, even on the expense of other samples. Hence, L2 loss function is highly sensitive to outliers in the dataset.&lt;/p&gt;

&lt;p&gt;We’ll see how outliers can affect the performance of a regression model. We are going to use pandas, scikit-learn and numpy to work through this. I’d highly recommend to have a look at the &lt;a href=&quot;http://nbviewer.ipython.org/github/rishy/rishy.github.io/blob/master/ipy_notebooks/L1%20vs.%20L2%20Loss.ipynb&quot;&gt;ipython notebook&lt;/a&gt; containing the code on this post.&lt;/p&gt;

&lt;p&gt;We’ll be using Boston Housing Prices dataset and will to try to predict the prices using Gradient Boosting Regressor from scikit-learn. You can downloaded the dataset directly from &lt;a href=&quot;https://archive.ics.uci.edu/ml/datasets/Housing&quot;&gt;UCI Datasets&lt;/a&gt; or from this &lt;a href=&quot;../../../../../ipy_notebooks/Datasets/Housing.csv&quot;&gt;csv&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We are goint to start with reading the data from the csv file.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;np&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pd&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.cross_validation&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_test_split&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.ensemble&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GradientBoostingRegressor&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;statsmodels.tools.eval_measures&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rmse&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;matplotlib.pylab&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;plt&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Make pylab inline and set the theme to 'ggplot'&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;style&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;use&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'ggplot'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pylab&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inline&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Read Boston Housing Data&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Datasets/Housing.csv'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Create a data frame with all the independent features&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data_indep&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'medv'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Create a target vector(vector of dependent variable, i.e. 'medv')&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data_dep&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'medv'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Split data into training and test sets&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;train_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_test_split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                                    &lt;span class=&quot;n&quot;&gt;data_indep&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data_dep&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                    &lt;span class=&quot;n&quot;&gt;test_size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                    &lt;span class=&quot;n&quot;&gt;random_state&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;42&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h4 id=&quot;regression-without-any-outliers&quot;&gt;Regression without any Outliers:&lt;/h4&gt;

&lt;p&gt;At this moment, our housing dataset is pretty much clean and doesn’t contain any outliers as such.
So let’s fit a GB regressor with L1 and L2 loss functions.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# GradientBoostingRegressor with a L1(Least Absolute Deviations) loss function&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Set a random seed so that we can reproduce the results&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32767&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GradientBoostingRegressor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loss&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'lad'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;test_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Root Mean Squared Error&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;RMSE -&amp;gt; &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;f&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rmse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;With a L1 loss function and no outlier we get a value of RMSE: 3.440147.
Let’s see what results we get with L2 loss function.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# GradientBoostingRegressor with L2(Least Square errors) loss function&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GradientBoostingRegressor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loss&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'ls'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;test_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Root Mean Squared Error&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;RMSE -&amp;gt; &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;f&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rmse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This prints out a mean squared value of RMSE -&amp;gt; 2.542019.&lt;/p&gt;

&lt;p&gt;As apparent from RMSE errors of L1 and L2 loss functions, Least Squares(L2)
outperform L1, when there are no outliers in the data.&lt;/p&gt;

&lt;h4 id=&quot;regression-with-outliers&quot;&gt;Regression with Outliers:&lt;/h4&gt;

&lt;p&gt;After looking at the minimum and maximum values of ‘medv’ column, we can see
that the range of values in ‘medv’ is [5, 50].&lt;br /&gt;
Let’s add a few Outliers in this Dataset, so that we can see some significant
differences with &lt;b&gt;L1&lt;/b&gt; and &lt;b&gt;L2&lt;/b&gt; loss functions.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Get upper and lower bounds[min, max] of all the features&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;describe&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;extremes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'min'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'max'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],:]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'medv'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;extremes&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now, we are going to generate 5 random samples, such that their values lies in
the [min, max] range of respective features.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Set a random seed&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Create 5 random values &lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rands&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rands&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Get the 'min' and 'max' rows as numpy array&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;min_array&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;extremes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'min'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;max_array&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;extremes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'max'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:])&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Find the difference(range) of 'max' and 'min'&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_array&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;min_array&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Generate 5 samples with 'rands' value&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;outliers_X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rands&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;min_array&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;outliers_X&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;array([[  17.04578252,   19.15194504,    5.68465061,    0.19151945,
           0.47807845,    4.56054001,   21.49653863,    3.23572024,
           5.40494736,  287.356192  ,   14.40028283,   76.27278363,
           8.67066488],…,
       [  69.40067405,   77.99758081,   21.73774005,    0.77997581,
           0.76406824,    7.63169374,   78.63565097,    9.70691596,
          18.93944359,  595.70732345,   19.9317726 ,  309.64280598,
          29.99632329]])&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# We will also create some hard coded outliers&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# for 'medv', i.e. our target&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;medv_outliers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;600&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;700&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;600&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Change the type of 'chas', 'rad' and 'tax' to rounded of Integers&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;outliers_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;round&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;outliers_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[:,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]))&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Finally concatenate our existing 'train_X' and&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# 'train_y' with these outliers&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;train_X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;outliers_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;train_y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;medv_outliers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Plot a histogram of 'medv' in train_y&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figure&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figsize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bins&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;800&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;suptitle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'medv Count'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlabel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'medv'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylabel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'count'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img src=&quot;../../../../../images/l1-l2-loss.png&quot; alt=&quot;png&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can see there are some clear outliers at 600, 700 and even one or two ‘medv’
values are 0.&lt;br /&gt;
Since, our outliers are in place now, we will once again fit the
GradientBoostingRegressor with L1 and L2 Loss functions to see the contrast in
their performances with outliers.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# GradientBoostingRegressor with L1 loss function&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;9876&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GradientBoostingRegressor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loss&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'lad'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;test_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Root Mean Squared Error&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;RMSE -&amp;gt; &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;f&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rmse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;We get a RMSE value of 7.055568, with L1 loss function and existing outliers.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# GradientBoostingRegressor with L2 loss function&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GradientBoostingRegressor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loss&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'ls'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;test_X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Root Mean Squared Error&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;RMSE -&amp;gt; &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;f&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rmse&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;On the other hand, we get a RMSE value of 9.806251, with L2 loss function and existing outliers.&lt;/p&gt;

&lt;p&gt;With outliers in the dataset, a L2(Loss function) tries to adjust the
model according to these outliers on the expense of other
good-samples, since the squared-error is going to be huge for these outliers(for
error &amp;gt; 1). On the other hand L1(Least absolute deviation) is quite resistant to
outliers.&lt;br /&gt;
As a result, L2 loss function may result in huge deviations in some of the
samples which results in reduced accuracy.&lt;/p&gt;

&lt;p&gt;So, if you can ignore the ouliers in your dataset or you need them to be there, then you should be using a L1 loss function, on the other hand if you don’t want undesired outliers in the dataset and would like to use a stable solution then first of all you should try to remove the outliers and then use a L2 loss function. Or performance of a model with a L2 loss function may deteriorate badly due to the presence of outliers in the dataset.&lt;/p&gt;

&lt;p&gt;Whenever in doubt, prefer L2 loss function, it works pretty well in most of the situations.&lt;/p&gt;

&lt;!-- % if page.comments % --&gt;

&lt;div id=&quot;disqus_thread&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;
    /* * * CONFIGURATION VARIABLES * * */
    var disqus_shortname = 'rishabhshukla';
    
    /* * * DON'T EDIT BELOW THIS LINE * * */
    (function() {
        var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
        dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
        (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
    })();
&lt;/script&gt;

&lt;noscript&gt;Please enable JavaScript to view the &lt;a href=&quot;https://disqus.com/?ref_noscript&quot; rel=&quot;nofollow&quot;&gt;comments powered by Disqus.&lt;/a&gt;&lt;/noscript&gt;

&lt;!-- % endif % --&gt;
</description>
        <pubDate>Tue, 28 Jul 2015 17:30:05 +0530</pubDate>
        <link>http://rishy.github.io//ml/2015/07/28/l1-vs-l2-loss/</link>
        <guid isPermaLink="true">http://rishy.github.io//ml/2015/07/28/l1-vs-l2-loss/</guid>
      </item>
    
      <item>
        <title>Normal/Gaussian Distributions</title>
        <description>&lt;p&gt;Normal Distributions are the most common distributions in statistics primarily because they describe a lot of natural phenomena. Normal distributions are also known as ‘Gaussian distributions’ or ‘bell curve’, because of the bell shaped curve.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../../../../../images/normal_distributions.png&quot; alt=&quot;bell&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Samples of heights of people, size of things produced by machines, errors in measurements, blood pressure, marks in an examination, wages payed to employees by a company, life span of a species, all of these follows a normal or nearly normal distribution.&lt;/p&gt;

&lt;p&gt;I don’t intend to cover a lot of mathematical background regarding normal distributions, still it won’t hurt to know just a few simple mathematical properties of normal distributions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Bell curve is symmetrical about mean(which lies at the center)&lt;/li&gt;
  &lt;li&gt;mean = median = mode&lt;/li&gt;
  &lt;li&gt;Only determining factors of normal distributions are its mean and standard deviation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can also get a normal distribution from a lot of datasets using &lt;a href=&quot;http://en.wikipedia.org/wiki/Central_limit_theorem&quot;&gt;Central Limit Theorem&lt;/a&gt;(CLT). In layman’s language CLT states that if we take a large number of samples from a population, multiple times and go on plotting these then it will result in a normal distribution(which can be used by a lot of statistical and machine learning models).&lt;/p&gt;

&lt;p&gt;A lot of machine learning models assumes that data fed to these models follows a normal distribution. So, after you have got your data cleaned, you should definitely check what distribution it follows. Some of the machine learning and Statistical models which assumes a normally distributed input data are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Gaussian naive Bayes&lt;/li&gt;
  &lt;li&gt;Least Squares based (regression)models&lt;/li&gt;
  &lt;li&gt;LDA&lt;/li&gt;
  &lt;li&gt;QDA&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is also quite common to transform non-normal data to normal form by applying log, square root or similar transormations.&lt;/p&gt;

&lt;p&gt;If plotting the data results in a skewed plot, then it is probably a log-normal distribution(as shown in figure below), which you can transform into normal form, simply by applying a log function on all data points.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../../../../../images/log-normal.png&quot; alt=&quot;log-normal&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Once it is transformed into normal distributions, you are free to use this dataset with models assuming a normal input data(as listed in above section).&lt;/p&gt;

&lt;p&gt;As a general approach, &lt;b&gt;Always look at the statistical/probability distributions&lt;/b&gt; as your first step in data analysis.&lt;/p&gt;

&lt;!-- % if page.comments % --&gt;

&lt;div id=&quot;disqus_thread&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;
    /* * * CONFIGURATION VARIABLES * * */
    var disqus_shortname = 'rishabhshukla';
    
    /* * * DON'T EDIT BELOW THIS LINE * * */
    (function() {
        var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
        dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
        (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
    })();
&lt;/script&gt;

&lt;noscript&gt;Please enable JavaScript to view the &lt;a href=&quot;https://disqus.com/?ref_noscript&quot; rel=&quot;nofollow&quot;&gt;comments powered by Disqus.&lt;/a&gt;&lt;/noscript&gt;

&lt;!-- % endif % --&gt;
</description>
        <pubDate>Tue, 21 Jul 2015 09:30:01 +0530</pubDate>
        <link>http://rishy.github.io//stats/2015/07/21/normal-distributions/</link>
        <guid isPermaLink="true">http://rishy.github.io//stats/2015/07/21/normal-distributions/</guid>
      </item>
    
      <item>
        <title>Electricity Demand Analysis and Appliance Detection</title>
        <description>&lt;p&gt;In this post, we are going to analyze electricity consumption data from a house. We have a  time-series dataset which contains the power(kWh), Cost of electricity and Voltage at a particular time stamp. We are further provided with the temperature records during the same day for each hour. You can download the compressed dataset from &lt;a href=&quot;https://github.com/rishy/electricity-demand-analysis/blob/master/data-science.gz&quot;&gt;here&lt;/a&gt;. I’d further recommend you to have a look at the corresponding &lt;a href=&quot;https://github.com/rishy/electricity-demand-analysis/blob/master/Electricity%20Demand.ipynb&quot;&gt;ipython notebook&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;First part is the Data Analysis Part where we will be doing the basic data cleaning and analysis regarding the power demand and cost incurred. The second part employs a KMeans clustering approach to identify which appliance might be the major cause for the power demand in a particular hour of the day.&lt;/p&gt;

&lt;p&gt;So let’s start with basic imports and reading of data from the given dataset.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;np&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pd&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;plt&lt;/span&gt;


&lt;span class=&quot;c&quot;&gt;# Read the sensor dataset into pandas dataframe&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'merged-sensor-files.csv'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                          &lt;span class=&quot;n&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;MTU&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Time&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Power&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                          &lt;span class=&quot;s&quot;&gt;&quot;Cost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Voltage&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;header&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Read the weather data in pandas series object&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;weather_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'weather.json'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;typ&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'series'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;c&quot;&gt;# A quick look at the datasets&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;head&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;div style=&quot;max-height:1000px;max-width:1500px;overflow:auto;&quot;&gt;
&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;MTU&lt;/th&gt;
      &lt;th&gt;Time&lt;/th&gt;
      &lt;th&gt;Power&lt;/th&gt;
      &lt;th&gt;Cost&lt;/th&gt;
      &lt;th&gt;Voltage&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;MTU1&lt;/td&gt;
      &lt;td&gt;05/11/2015 19:59:06&lt;/td&gt;
      &lt;td&gt;4.102&lt;/td&gt;
      &lt;td&gt;0.62&lt;/td&gt;
      &lt;td&gt;122.4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;MTU1&lt;/td&gt;
      &lt;td&gt;05/11/2015 19:59:05&lt;/td&gt;
      &lt;td&gt;4.089&lt;/td&gt;
      &lt;td&gt;0.62&lt;/td&gt;
      &lt;td&gt;122.3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;MTU1&lt;/td&gt;
      &lt;td&gt;05/11/2015 19:59:04&lt;/td&gt;
      &lt;td&gt;4.089&lt;/td&gt;
      &lt;td&gt;0.62&lt;/td&gt;
      &lt;td&gt;122.3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;MTU1&lt;/td&gt;
      &lt;td&gt;05/11/2015 19:59:06&lt;/td&gt;
      &lt;td&gt;4.089&lt;/td&gt;
      &lt;td&gt;0.62&lt;/td&gt;
      &lt;td&gt;122.3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;MTU1&lt;/td&gt;
      &lt;td&gt;05/11/2015 19:59:04&lt;/td&gt;
      &lt;td&gt;4.097&lt;/td&gt;
      &lt;td&gt;0.62&lt;/td&gt;
      &lt;td&gt;122.4&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
Let’s have a quick look at the weather dataset as well:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;weather_data
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;pre&gt;
2015-05-12 00:00:00    75.4
2015-05-12 01:00:00    73.2
2015-05-12 02:00:00    72.1
2015-05-12 03:00:00    71.0
2015-05-12 04:00:00    70.7
.
.
dtype: float64
&lt;/pre&gt;

&lt;h2 id=&quot;task-1-data-analysis&quot;&gt;TASK 1: Data Analysis&lt;/h2&gt;

&lt;h3 id=&quot;data-cleaningmunging&quot;&gt;Data Cleaning/Munging:&lt;/h3&gt;
&lt;p&gt;After having a look at the &lt;b&gt;merged-sensor-files.csv&lt;/b&gt; I found out there are some inconsistent rows where header names are repeated and as a result ‘pandas’ is converting all these columns to ‘object’ type. This is quite a common problem, which arises while merging multiple csv files into a single file.&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;sensor_data.dtypes
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;pre&gt;
MTU        object
Time       object
Power      object
Cost       object
Voltage    object
dtype: object
&lt;/pre&gt;

&lt;p&gt;Let’s find out and remove these inconsistent rows so that all the columns can be converted to appropriate data types.&lt;/p&gt;

&lt;p&gt;The code below finds all the rows where “Power” column has a string value - “Power” and get the index of these rows.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Get the inconsistent rows indexes&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;faulty_row_idx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Power&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot; Power&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tolist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;faulty_row_idx&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;pre&gt;
[3784,
 7582,
 11385,
 .
 .
 81617,
 85327]
&lt;/pre&gt;

&lt;p&gt;and now we can drop these rows from the dataframe&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Drop these rows from sensor_data dataframe&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;faulty_row_idx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# This should return an empty list now&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Power&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot; Power&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tolist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;pre&gt;
  []
&lt;/pre&gt;

&lt;p&gt;We have cleaned up the sensor_data and now all the columns can be converted to more appropriate data types.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Type Conversion&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Power&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Cost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Voltage&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Power&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                &lt;span class=&quot;s&quot;&gt;&quot;Cost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Voltage&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;astype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Time&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_datetime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Time&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Also add an 'Hour' column in sensor_data&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Hour'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DatetimeIndex&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Time&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hour&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtypes&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;pre&gt;
MTU                object
Time       datetime64[ns]
Power             float64
Cost              float64
Voltage           float64
Hour                int32
dtype: object
&lt;/pre&gt;

&lt;p&gt;
This is better now. We have got clearly defined datatypes of different columns now. Next step is to convert the weather_data Series to a dataframe so that we can work with it with more ease.
&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Create a dataframe out of weather dataset as well&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;temperature_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;weather_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_frame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Reindex it so as to create a two column dataframe&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;temperature_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reset_index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;level&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;temperature_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;columns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Time&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Temperature&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Add the &quot;Hour&quot; column in temperature_data&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;temperature_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Hour&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DatetimeIndex&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                            &lt;span class=&quot;n&quot;&gt;temperature_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Time&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hour&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;temperature_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtypes&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;pre&gt;
  Time           datetime64[ns]
  Temperature           float64
  Hour                    int32
  dtype: object
&lt;/pre&gt;

&lt;p&gt;
Since now we have both of our dataframes in place, it'd be a good point to have a look at sum of the basic statistics of both of these data frames.
&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;describe&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;div&gt;
&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Power&lt;/th&gt;
      &lt;th&gt;Cost&lt;/th&gt;
      &lt;th&gt;Voltage&lt;/th&gt;
      &lt;th&gt;Hour&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;count&lt;/th&gt;
      &lt;td&gt;88891.000000&lt;/td&gt;
      &lt;td&gt;88891.000000&lt;/td&gt;
      &lt;td&gt;88891.000000&lt;/td&gt;
      &lt;td&gt;88891.000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;mean&lt;/th&gt;
      &lt;td&gt;1.315980&lt;/td&gt;
      &lt;td&gt;0.202427&lt;/td&gt;
      &lt;td&gt;123.127744&lt;/td&gt;
      &lt;td&gt;11.531865&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;std&lt;/th&gt;
      &lt;td&gt;1.682181&lt;/td&gt;
      &lt;td&gt;0.252357&lt;/td&gt;
      &lt;td&gt;0.838768&lt;/td&gt;
      &lt;td&gt;6.921671&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;min&lt;/th&gt;
      &lt;td&gt;0.113000&lt;/td&gt;
      &lt;td&gt;0.020000&lt;/td&gt;
      &lt;td&gt;121.000000&lt;/td&gt;
      &lt;td&gt;0.000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;25%&lt;/th&gt;
      &lt;td&gt;0.255000&lt;/td&gt;
      &lt;td&gt;0.040000&lt;/td&gt;
      &lt;td&gt;122.600000&lt;/td&gt;
      &lt;td&gt;6.000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;50%&lt;/th&gt;
      &lt;td&gt;0.367000&lt;/td&gt;
      &lt;td&gt;0.060000&lt;/td&gt;
      &lt;td&gt;123.100000&lt;/td&gt;
      &lt;td&gt;12.000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;75%&lt;/th&gt;
      &lt;td&gt;1.765000&lt;/td&gt;
      &lt;td&gt;0.270000&lt;/td&gt;
      &lt;td&gt;123.700000&lt;/td&gt;
      &lt;td&gt;18.000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;max&lt;/th&gt;
      &lt;td&gt;6.547000&lt;/td&gt;
      &lt;td&gt;0.990000&lt;/td&gt;
      &lt;td&gt;125.600000&lt;/td&gt;
      &lt;td&gt;23.000000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;temperature_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;describe&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;div&gt;
&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Temperature&lt;/th&gt;
      &lt;th&gt;Hour&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;count&lt;/th&gt;
      &lt;td&gt;25.000000&lt;/td&gt;
      &lt;td&gt;25.00000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;mean&lt;/th&gt;
      &lt;td&gt;76.272000&lt;/td&gt;
      &lt;td&gt;11.04000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;std&lt;/th&gt;
      &lt;td&gt;6.635355&lt;/td&gt;
      &lt;td&gt;7.29429&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;min&lt;/th&gt;
      &lt;td&gt;67.900000&lt;/td&gt;
      &lt;td&gt;0.00000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;25%&lt;/th&gt;
      &lt;td&gt;69.600000&lt;/td&gt;
      &lt;td&gt;5.00000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;50%&lt;/th&gt;
      &lt;td&gt;75.400000&lt;/td&gt;
      &lt;td&gt;11.00000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;75%&lt;/th&gt;
      &lt;td&gt;83.000000&lt;/td&gt;
      &lt;td&gt;17.00000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;max&lt;/th&gt;
      &lt;td&gt;87.000000&lt;/td&gt;
      &lt;td&gt;23.00000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;&lt;br /&gt;
As apparent from above statistics there is a good amount of variation in &lt;b&gt;Power&lt;/b&gt; and corresponding &lt;b&gt;Cost&lt;/b&gt; values in &lt;b&gt;sensor_data&lt;/b&gt; dataframe, where average power is &lt;b&gt;1.315980kW&lt;/b&gt; and minimum and maximum power used throughout the day is &lt;b&gt;0.11kW&lt;/b&gt; and &lt;b&gt;6.54kW&lt;/b&gt; respectively. Similarily there is an apparent variation in temperature in &lt;b&gt;temperature_data&lt;/b&gt; dataset, most probably it attributes to day and night time. &lt;br /&gt;&lt;br /&gt;
To get a better understanding of these variations we’ll be plotting power and temperatures with the timestamps, so as to find out the peak times for both.&lt;br /&gt;
But before moving to visualizations we’ll have to create respective grouped datasets from &lt;b&gt;sensor_data&lt;/b&gt; and &lt;b&gt;temperature_data&lt;/b&gt;, grouping by the “Hour” column. This way we can work on hourly basis.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Group sensor_data by 'Hour' Column&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;grouped_sensor_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupby&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                        &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Hour&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as_index&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;grouped_sensor_data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;div&gt;
&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Hour&lt;/th&gt;
      &lt;th&gt;Power&lt;/th&gt;
      &lt;th&gt;Cost&lt;/th&gt;
      &lt;th&gt;Voltage&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0.173790&lt;/td&gt;
      &lt;td&gt;0.029468&lt;/td&gt;
      &lt;td&gt;124.723879&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.179594&lt;/td&gt;
      &lt;td&gt;0.033805&lt;/td&gt;
      &lt;td&gt;124.522469&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;0.185763&lt;/td&gt;
      &lt;td&gt;0.037013&lt;/td&gt;
      &lt;td&gt;123.929979&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;.&lt;/th&gt;
      &lt;td&gt;.&lt;/td&gt;
      &lt;td&gt;.&lt;/td&gt;
      &lt;td&gt;.&lt;/td&gt;
      &lt;td&gt;.&lt;/td&gt;
    &lt;/tr&gt;    
    &lt;tr&gt;
      &lt;th&gt;22&lt;/th&gt;
      &lt;td&gt;22&lt;/td&gt;
      &lt;td&gt;2.542672&lt;/td&gt;
      &lt;td&gt;0.387109&lt;/td&gt;
      &lt;td&gt;123.542620&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;23&lt;/th&gt;
      &lt;td&gt;23&lt;/td&gt;
      &lt;td&gt;2.269941&lt;/td&gt;
      &lt;td&gt;0.346457&lt;/td&gt;
      &lt;td&gt;123.415791&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Group temperature_data by &quot;Hour&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;grouped_temperature_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;temperature_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupby&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                            &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Hour&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as_index&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;grouped_temperature_data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;div&gt;
&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Hour&lt;/th&gt;
      &lt;th&gt;Temperature&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;78.25&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;73.20&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;.&lt;/th&gt;
      &lt;td&gt;.&lt;/td&gt;
      &lt;td&gt;.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;22&lt;/th&gt;
      &lt;td&gt;22&lt;/td&gt;
      &lt;td&gt;84.40&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;23&lt;/th&gt;
      &lt;td&gt;23&lt;/td&gt;
      &lt;td&gt;83.00&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;h2 id=&quot;basic-visualizations&quot;&gt;Basic Visualizations:&lt;/h2&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Generates all the visualizations right inside the ipython notebook&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pylab&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inline&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;style&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;use&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'ggplot'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figure&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figsize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Power&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bins&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;suptitle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Power Histogram'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlabel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Power'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylabel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Count'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img src=&quot;../../../../../images/power-histogram.png&quot; alt=&quot;power-histogram&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Looks like most of the time this house is consuming a limited amount of power. Although there is also a good amount of distribution in the range of &lt;b&gt;3.5kW - 5kW&lt;/b&gt;, indicating a higher demand.&lt;br /&gt;
Let’s now plot the Power Distribution with the day hours.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figure&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figsize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grouped_sensor_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Hour&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;grouped_sensor_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Power&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;suptitle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Power Distribution with Hours'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlabel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Hour'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylabel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Power'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xticks&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img src=&quot;../../../../../images/power-time-distribution.png&quot; alt=&quot;power-time-distribution&quot; /&gt;&lt;/p&gt;

&lt;h4 id=&quot;some-of-the-inferences-we-can-get-from-this-bar-chart-are&quot;&gt;Some of the inferences we can get from this bar chart are:&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;Highest Demand is noticed during the evening hours. This is quite expected since most of the equipments would be in ‘on’ state during this time like AC(during summers), room heaters(during winters), TV, Oven, Washing Machine, Lights, etc.&lt;/li&gt;
  &lt;li&gt;Night hours(0000 - 0500) and office hours(0900 - 1600) have very low demand, since most of the appliances will be in ‘off’ state during this period.&lt;/li&gt;
  &lt;li&gt;There is a slight increase in Power during morning hours from 0600 - 0900, which should account for the power used by the appliances during morning activities, lights, geysers, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;steady-states&quot;&gt;Steady States:&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;In the time period &lt;b&gt;0000 - 0500&lt;/b&gt;, demand is noticeably less and ranges between &lt;b&gt;0.17kW - 0.18kW&lt;/b&gt;&lt;/li&gt;
  &lt;li&gt;Another steady period is from &lt;b&gt;1000 - 1500&lt;/b&gt;, demand is pretty much steady between &lt;b&gt;0.373kW - 0.376kW&lt;/b&gt;&lt;/li&gt;
  &lt;li&gt;Steady state with highest demand is from &lt;b&gt;1600 - 1900&lt;/b&gt; having a range between &lt;b&gt;4.36kW - 4.25kW&lt;/b&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some sudden changes in Demand during 0700 and 1800 can be attributed because of random events or the usage of certain appliances and may be counted as noise in the dataset.&lt;/p&gt;

&lt;p&gt;Similarily there is a slight oscillation in demand during 0900 which suddenly falls down from 0.38kW to 0.16kW and rises up again to about 0.37kW. Similar change in demand is seen at 2100.&lt;/p&gt;

&lt;p&gt;Let’s further plot temperature with the Power to see if there is any correlation among these.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figure&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figsize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grouped_temperature_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Temperature&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;grouped_sensor_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Power&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;suptitle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Power Distribution with Temperature'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xlabel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Temperature in Fahrenheit'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ylabel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Power'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fontsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img src=&quot;../../../../../images/power-temp-distribution.png&quot; alt=&quot;power-temp-distribution&quot; /&gt;&lt;/p&gt;

&lt;p&gt;There seems to be a direct correlation between temperature and the demand of power. This makes sense, since with our current dataset which is from May, this shows that cooling appliances like AC, refrigerator, etc. are consuming a lot of power during the peak hours(evening).&lt;/p&gt;

&lt;h2 id=&quot;task-2-machine-learning&quot;&gt;Task 2: Machine Learning&lt;/h2&gt;

&lt;p&gt;We’ll start with merging the &lt;b&gt;grouped_sensor_data&lt;/b&gt; and &lt;b&gt;grouped_temperature_data&lt;/b&gt; so that we can work on the complete dataset from a single dataframe.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Merge grouped_sensor_data and grouped_temperature_data&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# using &quot;Hour&quot; as the key&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;merged_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;grouped_sensor_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grouped_temperature_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In previous visualization we saw that when temperature is low generally there is less demand of power. But that mainly relates to the cooling appliances in the home. We’ll consider the following appliances:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Cooling Systems&lt;/li&gt;
  &lt;li&gt;TV&lt;/li&gt;
  &lt;li&gt;Geyser&lt;/li&gt;
  &lt;li&gt;Lights&lt;/li&gt;
  &lt;li&gt;Oven&lt;/li&gt;
  &lt;li&gt;Home Security Systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and would try to identify there presence or on/off state using the merged dataset.&lt;br /&gt;&lt;/p&gt;

&lt;h4 id=&quot;ac-refrigerator-and-other-coooling-systems&quot;&gt;AC, Refrigerator and Other Coooling Systems:&lt;/h4&gt;
&lt;p&gt;As apparent from “Power Distribution with Temperature” figure, there is a sudden increase in power demand with the rise in temperature. This clearly indicates the &lt;b&gt;ON&lt;/b&gt; state of one or more cooling systems in the home. Since these appliances takes a considerable amount of power, this sudden upsurge in the power is quite justified. Clearly &lt;b&gt;Power&lt;/b&gt; and &lt;b&gt;Temperature&lt;/b&gt; are the two features that indicates the ‘ON’ state of these appliances. Although ‘Cost’ feature is also correlated with ‘Power’ we’d leave it out, since it is more of a causation of Power demand, then a completely independent feature.&lt;/p&gt;

&lt;h4 id=&quot;tv&quot;&gt;TV:&lt;/h4&gt;
&lt;p&gt;During the evening hours(1600 - 2300), an ‘ON’ television set is probably another factor for increased power demand. It is quite apparent from the &lt;b&gt;Power&lt;/b&gt; feature.&lt;/p&gt;

&lt;h4 id=&quot;geyser-oven&quot;&gt;Geyser, Oven:&lt;/h4&gt;
&lt;p&gt;Slight increase in power demand during morning hours can be related to the presence of these appliances and is justified again by the &lt;b&gt;Power&lt;/b&gt; feature.&lt;/p&gt;

&lt;h4 id=&quot;lights&quot;&gt;Lights:&lt;/h4&gt;
&lt;p&gt;It’s quite obvious there is a small contribution(considering house owner was smart and installed LED bulbs ;) ) of lights in the house in the ‘Power’ demand. And of course it only makes sense to switch ‘ON’ the lights during &lt;em&gt;darker times&lt;/em&gt; :D of the day, &lt;b&gt;Hour&lt;/b&gt; and Low &lt;b&gt;Power&lt;/b&gt; are the indicators of lights.&lt;/p&gt;

&lt;h4 id=&quot;home-security-systems&quot;&gt;Home Security Systems:&lt;/h4&gt;
&lt;p&gt;During the office hours there’s a very little increase in the Power demand, this can be attributed to home security systems or other automated devices.&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Now, we’ll be using simple &lt;b&gt;K-Means clustering&lt;/b&gt; using &lt;b&gt;scikit-learn&lt;/b&gt;. We are going to consider &lt;b&gt;Hour, Power and Temperature&lt;/b&gt; feature from the original dataset. For that first of all we’ll have to merge the sensor_data dataframe with grouped_temperature_data dataframe.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Complete merged dataset&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sensor_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grouped_temperature_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Lets drop Time, MTU, Cost and Voltage features&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Time&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;MTU&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Cost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Voltage&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                        &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Import required modules from scikit-learn&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.cluster&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;KMeans&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.cross_validation&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_test_split&lt;/span&gt;


&lt;span class=&quot;c&quot;&gt;# Set a random seed, so we can reproduce the results&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Divide the merged dataset into train and test datasets&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;train_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_test_split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.25&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                              &lt;span class=&quot;n&quot;&gt;random_state&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;42&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Perform K-Means clustering over the train dataset&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;kmeans&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;KMeans&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_clusters&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n_jobs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;kmeans_fit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kmeans&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; 

&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kmeans_fit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;test_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;test_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Cluster&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;div&gt;
&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Power&lt;/th&gt;
      &lt;th&gt;Hour&lt;/th&gt;
      &lt;th&gt;Temperature&lt;/th&gt;
      &lt;th&gt;Cluster&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;52595&lt;/th&gt;
      &lt;td&gt;0.114&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;69.2&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;.&lt;/th&gt;
      &lt;td&gt;.&lt;/td&gt;
      &lt;td&gt;.&lt;/td&gt;
      &lt;td&gt;.&lt;/td&gt;
      &lt;td&gt;.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7834&lt;/th&gt;
      &lt;td&gt;1.094&lt;/td&gt;
      &lt;td&gt;21&lt;/td&gt;
      &lt;td&gt;84.2&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;25231&lt;/th&gt;
      &lt;td&gt;0.125&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;73.2&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;This looks like a pretty reasonable clustering. We can further assign the labels to these clusters, as an appliance detection model. As apparent from the predicted result, We can set the labels for clusters as:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;b&gt;0&lt;/b&gt; - Cooling Systems&lt;/li&gt;
  &lt;li&gt;&lt;b&gt;1&lt;/b&gt; - Oven, Geyser&lt;/li&gt;
  &lt;li&gt;&lt;b&gt;2&lt;/b&gt; - Night Lights&lt;/li&gt;
  &lt;li&gt;&lt;b&gt;3&lt;/b&gt; - Home Security Systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We’ll create a data frame with these labels and merge it with predicted results.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c&quot;&gt;# Create a dataframe with appliance labels&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;label_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;({&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Cluster&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
                         &lt;span class=&quot;s&quot;&gt;&quot;Appliances&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Cooling System&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                        &lt;span class=&quot;s&quot;&gt;&quot;Oven, Geyser&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                        &lt;span class=&quot;s&quot;&gt;&quot;Night Lights&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                        &lt;span class=&quot;s&quot;&gt;&quot;Home Security Systems&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]})&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Merge predicted cluster values for test data set&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# with our label dataframe&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;label_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;head&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;div&gt;
&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Power&lt;/th&gt;
      &lt;th&gt;Hour&lt;/th&gt;
      &lt;th&gt;Temperature&lt;/th&gt;
      &lt;th&gt;Cluster&lt;/th&gt;
      &lt;th&gt;Appliances&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;0.114&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;69.2&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;Oven, Geyser&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tail&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;div&gt;
&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;Power&lt;/th&gt;
      &lt;th&gt;Hour&lt;/th&gt;
      &lt;th&gt;Temperature&lt;/th&gt;
      &lt;th&gt;Cluster&lt;/th&gt;
      &lt;th&gt;Appliances&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;22218&lt;/th&gt;
      &lt;td&gt;0.306&lt;/td&gt;
      &lt;td&gt;15&lt;/td&gt;
      &lt;td&gt;80.7&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;Home Security Systems&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;I think this makes sense. As apparent from &lt;em&gt;result&lt;/em&gt; dataframe, in hours like 8, 9, 10 there is a high possibility that a Oven or Geyser is being used. On the other hand during office hours(1000 - 1600), most probably Home Security Appliances are taking the power.&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Starting from the very beginning, i.e. the Data Analysis process, I think with more data we could group it according to the days(for a week’s or month’s data), or by months(for a year’s data). That could’ve significantly changed the predicted Power values, since the average values over these larger intervals would be smoother.&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;We’d also have to take care of the seasons and temperature, since different appliances would be taking power in different seasons, so clustering would turn into a bit complicated task compared to what we did with data of just one day.&lt;/p&gt;

&lt;p&gt;The most important data that could help in a more accurate analysis would be the power consumption amount of all the appliances in the house. That way it’d be much easier to understand what appliance is taking more power in a certain period of time.&lt;/p&gt;

&lt;p&gt;Furthermore, this would also help during the classification task, since we would already know that certain appliances requires much power, hence we could more accurately classify a sample.&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;One limitation is the number of features we have in this dataset, to learn new features a simple neural net could also be employed to get some hidden patterns here.&lt;/p&gt;

&lt;p&gt;You can further look at the Github repo with the above code at: &lt;a href=&quot;https://github.com/rishy/electricity-demand-analysis&quot;&gt;rishy/electricity-demand-analysis&lt;/a&gt;. Your feedbacks and comments are always welcomed.&lt;/p&gt;

&lt;p&gt;Related Papers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.sciencedirect.com/science/article/pii/S037877881200151X&quot;&gt;http://www.sciencedirect.com/science/article/pii/S037877881200151X&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://cs.gmu.edu/~jessica/publications/astronomy11.pdf&quot;&gt;http://cs.gmu.edu/~jessica/publications/astronomy11.pdf&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;!-- % if page.comments % --&gt;

&lt;div id=&quot;disqus_thread&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;
    /* * * CONFIGURATION VARIABLES * * */
    var disqus_shortname = 'rishabhshukla';
    
    /* * * DON'T EDIT BELOW THIS LINE * * */
    (function() {
        var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
        dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
        (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
    })();
&lt;/script&gt;

&lt;noscript&gt;Please enable JavaScript to view the &lt;a href=&quot;https://disqus.com/?ref_noscript&quot; rel=&quot;nofollow&quot;&gt;comments powered by Disqus.&lt;/a&gt;&lt;/noscript&gt;

&lt;!-- % endif % --&gt;
</description>
        <pubDate>Wed, 10 Jun 2015 22:22:00 +0530</pubDate>
        <link>http://rishy.github.io//projects/2015/06/10/electricity-demand/</link>
        <guid isPermaLink="true">http://rishy.github.io//projects/2015/06/10/electricity-demand/</guid>
      </item>
    
      <item>
        <title>Phishing Websites Detection</title>
        <description>&lt;p&gt;Detection of phishing websites is a really important safety measure for most of the online platforms. So, as to save a platform with malicious requests from such websites, it is important to have a robust phishing detection system in place.&lt;/p&gt;

&lt;p&gt;Thanks to people like, Rami M. Mohammad, Fadi Thabtah, and Lee McCluskey who have worked intensively in this area. In this post, we are going to use &lt;a href=&quot;http://archive.ics.uci.edu/ml/datasets/Phishing+Websites&quot;&gt;Phishing Websites Data&lt;/a&gt; from UCI Machine Learning Datasets. This dataset was donated by &lt;i&gt;Rami Mustafa A Mohammad&lt;/i&gt; for further analysis. Rami M. Mohammad, Fadi Thabtah, and Lee McCluskey have even used neural nets and various other models to create a really robust phishing detection system. I really encourage you to have a look at the original papers &lt;a href=&quot;http://eprints.hud.ac.uk/17994/3/RamiIntelligent_Rule_based_Phishing_Websites_Classification_IET_Journal.pdf&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;http://eprints.hud.ac.uk/18246/3/Predicting_Phishing_Websites_using_Neural_Network_trained_with_Back-Propagation.pdf&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For this very basic analysis, we are going to use multiple models, and see which one fits the best with our &lt;a href=&quot;https://github.com/rishy/phishing-websites/blob/master/Datasets/phising.csv&quot;&gt;dataset&lt;/a&gt;. And finally, the most important part - a breakdown of most important features to detect a phishing website using a &lt;code class=&quot;highlighter-rouge&quot;&gt;randomForest&lt;/code&gt; Fit.&lt;/p&gt;

&lt;p&gt;We’ll start with loading the &lt;a href=&quot;https://github.com/rishy/phishing-websites/blob/master/Datasets/phising.csv&quot;&gt;csv&lt;/a&gt; file, in our R Script and setting the new column names.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;caret&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;doMC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Register 4 cores for parallel computing
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;registerDoMC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Read the data from csv file
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read.csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'Datasets/phising.csv'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;header&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
				&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;colClasses&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;factor&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Names list for the features
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;has_ip&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;long_url&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;short_service&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;has_at&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
		   &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;double_slash_redirect&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;pref_suf&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;has_sub_domain&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
		   &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ssl_state&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;long_domain&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;favicon&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;port&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;https_token&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;req_url&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;url_of_anchor&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;tag_links&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;SFH&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;submit_to_email&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;abnormal_url&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;redirect&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;mouseover&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;right_click&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;popup&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;iframe&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;domain_Age&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dns_record&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;traffic&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;page_rank&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
           &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;google_index&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;links_to_page&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;stats_report&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;target&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Add column names
&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;names&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Here we are importing &lt;a href=&quot;caret.r-forge.r-project.org&quot;&gt;caret&lt;/a&gt; and &lt;a href=&quot;http://cran.r-project.org/web/packages/doMC/index.html&quot;&gt;doMC&lt;/a&gt; libraries and then registering &lt;b&gt;4 cores&lt;/b&gt; for parallel processing. You can set the number of cores according to your machine.&lt;/p&gt;

&lt;p&gt;All of the features in this dataset are factors, that’s the reason I have used &lt;code class=&quot;highlighter-rouge&quot;&gt;colClasses = &quot;factor&quot;&lt;/code&gt; in &lt;code class=&quot;highlighter-rouge&quot;&gt;read.csv&lt;/code&gt; method. You can have a look at the &lt;code class=&quot;highlighter-rouge&quot;&gt;README.md&lt;/code&gt; file in &lt;a href=&quot;https://github.com/rishy/phishing-websites&quot;&gt;this&lt;/a&gt; Github Repo, to get an overview of the possible values of each feature.&lt;/p&gt;

&lt;p&gt;Now, first thing first, let’s have a look at the &lt;code class=&quot;highlighter-rouge&quot;&gt;data&lt;/code&gt;,&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;pre&gt;
'data.frame':	2456 obs. of  31 variables:
 $ has_ip               : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 2 1 1 ...
 $ long_url             : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;-1&quot;: 2 2 1 ...
 $ short_service        : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 ...
 $ has_at               : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 ...
 $ double_slash_redirect: Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 2 1 1 ...
 $ pref_suf             : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;-1&quot;: 3 3 3 ...
 $ has_sub_domain       : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;-1&quot;: 3 1 3 ...
 $ ssl_state            : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;-1&quot;: 3 2 3 ...
 $ long_domain          : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;-1&quot;: 1 1 1 ...
 $ favicon              : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 ...
 $ port                 : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 ...
 $ https_token          : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 2 2 2 ...
 $ req_url              : Factor w/ 2 levels &quot;1&quot;,&quot;-1&quot;: 1 1 1 ...
 $ url_of_anchor        : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;-1&quot;: 3 1 1 ...
 $ tag_links            : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;-1&quot;: 2 3 3 ...
 $ SFH                  : Factor w/ 2 levels &quot;1&quot;,&quot;-1&quot;: 2 2 2 ...
 $ submit_to_email      : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 2 1 2 ...
 $ abnormal_url         : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 2 1 2 ...
 $ redirect             : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 ...
 $ mouseover            : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 ...
 $ right_click          : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 ...
 $ popup                : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 ...
 $ iframe               : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 ...
 $ domain_Age           : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;-1&quot;: 3 3 1 ...
 $ dns_record           : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 2 2 2 ...
 $ traffic              : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;-1&quot;: 3 1 2 ...
 $ page_rank            : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;-1&quot;: 3 3 3 ...
 $ google_index         : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 ...
 $ links_to_page        : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;-1&quot;: 2 2 1 ...
 $ stats_report         : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 2 1 2 ...
 $ target               : Factor w/ 2 levels &quot;1&quot;,&quot;-1&quot;: 1 1 1 ...
&lt;/pre&gt;

&lt;p&gt;So, we have some &lt;b&gt;30&lt;/b&gt; features and a &lt;code class=&quot;highlighter-rouge&quot;&gt;target&lt;/code&gt; variable with two levels(1, -1), i.e. whether a website is a phishing website or not.&lt;/p&gt;

&lt;p&gt;We’ll now create a training and test set using caret’s &lt;code class=&quot;highlighter-rouge&quot;&gt;createDataPartition&lt;/code&gt; method. We’ll use test set to validate the accuracy of our detection system.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;# Set a random seed so we can reproduce the results
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set.seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1234&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Create training and testing partitions
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_in&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;createDataPartition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
						&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0.75&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;FALSE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;training&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_in&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;testing&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train_in&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now, we are ready to try a few models on the dataset. Starting with a &lt;code class=&quot;highlighter-rouge&quot;&gt;Boosted logistic Regression&lt;/code&gt; model. Let’s see how that perform on our quest for the nearly perfect phishing detection system ;).&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;################ Boosted Logistic Regression ################
&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# trainControl for Boosted Logisitic Regression
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fitControl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;trainControl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'repeatedcv'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;repeats&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                           &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;number&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;verboseIter&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Run a Boosted logisitic regression over the training set
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;log.fit&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;training&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
				&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;LogitBoost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;trControl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fitControl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
				&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tuneLength&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Predict the testing target
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;log.predict&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;log.fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;testing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;-31&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;confusionMatrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;log.predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;testing&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;We are using caret’s &lt;code class=&quot;highlighter-rouge&quot;&gt;trainControl&lt;/code&gt; method to find out the best performing parameters using repeated cross-validation. After creating a confusion Matrix of the predicted values and the real target values, I could get a prediction accuracy of &lt;b&gt;0.9357&lt;/b&gt;, which is actually pretty good for a Boosted Logistic Regression model.&lt;/p&gt;

&lt;p&gt;But of course we have better choices for models, right? And there is no reason, for not using our one of the most favourite &lt;code class=&quot;highlighter-rouge&quot;&gt;SVM with an RBF Kernel&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;################## SVM - RBF Kernel ####################
&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# trainControl for Radial SVM
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fitControl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;trainControl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;repeatedcv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;repeats&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
						 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;number&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;verboseIter&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Run a RBF - SVM over the training set
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rbfsvm.fit&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;training&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
					&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;svmRadial&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;trControl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fitControl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
					&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tuneLength&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Predict the testing target
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rbfsvm.predict&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rbfsvm.fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;testing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;-31&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;confusionMatrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rbfsvm.predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;testing&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Woah! I am getting a &lt;b&gt;0.9706&lt;/b&gt; accuracy with a SVM and RBF Kernel. Looks like there is almost no escape for phishing websites now :D.&lt;/p&gt;

&lt;p&gt;But, since one of the most important reason I picked up this analysis was to find out the most important predictors, that can identify a phishing website, we’ll have to move to Tree-based models to get the variable importance.&lt;/p&gt;

&lt;p&gt;So, let’s fit a Tree bagging model on our dataset.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;################## TreeBag ###################
&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# trainControl for Treebag
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fitControl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;trainControl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;repeatedcv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;repeats&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
						 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;number&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;verboseIter&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Run a Treebag classification over the training set
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;treebag.fit&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;training&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
					 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;treebag&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;importance&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Predict the testing target
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;treebag.predict&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;treebag.fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;testing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;-31&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;confusionMatrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;treebag.predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;testing&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now, this is something, an accuracy of &lt;b&gt;0.9739&lt;/b&gt; and we also get our variable importances :).
But I am not going to show that, without fitting another tree model, the almighty(throw-anything-at-me) &lt;code class=&quot;highlighter-rouge&quot;&gt;Random Forests&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;####################### Random Forest ########################
&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# trainControl for Random Forest
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fitControl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;trainControl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;repeatedcv&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;repeats&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
						 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;number&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;verboseIter&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Run a Random Forest classification over the training set
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rf.fit&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;training&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;rf&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                     &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;importance&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;trControl&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fitControl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                     &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tuneLength&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;c1&quot;&gt;# Predict the testing target
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rf.predict&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rf.fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;testing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;-31&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;

&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;confusionMatrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rf.predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;testing&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;That’s some coincidence(or-not), with mtry = 21, we are still getting an accuracy of &lt;b&gt;)0.9739&lt;/b&gt; with our &lt;code class=&quot;highlighter-rouge&quot;&gt;Random Forest&lt;/code&gt; model, which is actually pretty good, even for practical purposes. so, finally let’s have a look at the variable importances of different features,&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-r&quot; data-lang=&quot;r&quot;&gt;&lt;table style=&quot;border-spacing: 0&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;gutter gl&quot; style=&quot;text-align: right&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;varImp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rf.fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img src=&quot;https://raw.githubusercontent.com/rishy/phishing-websites/master/varImp1.png&quot; alt=&quot;varImp&quot; /&gt;&lt;/p&gt;

&lt;p&gt;According to our Random Forest model, 10 of the most imporant features are:&lt;/p&gt;

&lt;blockquote&gt;
	&lt;pre&gt;
		* pref_suf-1           100.00
		* url_of_anchor-1       85.89
		* ssl_state1            84.59
		* has_sub_domain-1      69.18
		* traffic1              64.39
		* req_url-1             43.23
		* url_of_anchor1        37.58
		* long_domain-1         36.00
		* domain_Age-1          34.68
		* domain_Age1           29.54
	&lt;/pre&gt;
&lt;/blockquote&gt;

&lt;p&gt;Numerical values suffixing the features name are just the level of the factor of that particular feature. As apparent from this variable importance plot and from our own intuition, features listed here are indeed some of the most important attributes to find out whether a given sample is a phishing website or not.&lt;/p&gt;

&lt;p&gt;Like, if there is prefixes or suffixes being used in the url then there are very high chances that it’s a phishing website. Or a suspicious SSL state, having a sub domain in url, having a long domain url, etc. are actually really important features that can clearly identify a phishing website.&lt;/p&gt;

&lt;p&gt;One can create a phishing detection system pretty easily if he/she can get the information about these predictors. Rami M. Mohammad, Fadi Thabtah, and Lee McCluskey have also mentioned in their original paper, how they did it.&lt;/p&gt;

&lt;p&gt;I am sure that neural nets can further increase the accuracy of phishing detection system, but I tried to do a very basic analysis and it worked out pretty good. But of course getting and filtering out the data, creating factors out of different attributes is probably the most challanging task in phishing website detection.&lt;/p&gt;

&lt;p&gt;You can further look at the Github repo with the above code at: &lt;a href=&quot;https://github.com/rishy/phishing-websites&quot;&gt;rishy/phishing-websites&lt;/a&gt;. Your feedbacks and comments are always welcomed.&lt;/p&gt;

&lt;!-- % if page.comments % --&gt;

&lt;div id=&quot;disqus_thread&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;
    /* * * CONFIGURATION VARIABLES * * */
    var disqus_shortname = 'rishabhshukla';
    
    /* * * DON'T EDIT BELOW THIS LINE * * */
    (function() {
        var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
        dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
        (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
    })();
&lt;/script&gt;

&lt;noscript&gt;Please enable JavaScript to view the &lt;a href=&quot;https://disqus.com/?ref_noscript&quot; rel=&quot;nofollow&quot;&gt;comments powered by Disqus.&lt;/a&gt;&lt;/noscript&gt;

&lt;!-- % endif % --&gt;
</description>
        <pubDate>Fri, 08 May 2015 17:30:05 +0530</pubDate>
        <link>http://rishy.github.io//projects/2015/05/08/phishing-websites-detection/</link>
        <guid isPermaLink="true">http://rishy.github.io//projects/2015/05/08/phishing-websites-detection/</guid>
      </item>
    
      <item>
        <title>Google Summer of Code 2014</title>
        <description>&lt;p&gt;I have got selected for Google Summer of Code 2014 and from today onwards the Coding period for GSOC starts.&lt;/p&gt;

&lt;p&gt;For next 3 months I’ll be working on developing a Batch API for an awesome community - &lt;a href=&quot;http://mifos.org/&quot;&gt;mifos&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Mifos is an organization that leverages the power of Open Source to fuel the functioning of micro-finance institutions and helps in fighting poverty, worldwide.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://mifosforge.jira.com/wiki/display/projects/GSOC+2014+-+Batch+API&quot;&gt;Batch API&lt;/a&gt; project is focused towards creating an API akin to &lt;a href=&quot;https://developers.facebook.com/docs/graph-api/making-multiple-requests/&quot;&gt;facebook-multiple-requests&lt;/a&gt; so that multiple HTTP requests can be handled by mifos platform. This provides an efficient way for clients based on mifos platform to send multiple HTTP requests as JSON string and get back a Batch HTTP response in JSON.&lt;/p&gt;

&lt;p&gt;For this project I’ll be working with Java(Spring/Jersey/JPA) for developing the backend API and AngularJS and Bootstrap 3 for making all the UI changes. Development work regarding the Batch API can be followed at: &lt;a href=&quot;https://github.com/rishy/mifosx/tree/Batch-API&quot;&gt;Github - mifos X Batch API&lt;/a&gt;.&lt;/p&gt;

&lt;!-- % if page.comments % --&gt;

&lt;div id=&quot;disqus_thread&quot;&gt;&lt;/div&gt;
&lt;script type=&quot;text/javascript&quot;&gt;
    /* * * CONFIGURATION VARIABLES * * */
    var disqus_shortname = 'rishabhshukla';
    
    /* * * DON'T EDIT BELOW THIS LINE * * */
    (function() {
        var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
        dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
        (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
    })();
&lt;/script&gt;

&lt;noscript&gt;Please enable JavaScript to view the &lt;a href=&quot;https://disqus.com/?ref_noscript&quot; rel=&quot;nofollow&quot;&gt;comments powered by Disqus.&lt;/a&gt;&lt;/noscript&gt;

&lt;!-- % endif % --&gt;
</description>
        <pubDate>Mon, 19 May 2014 07:00:45 +0530</pubDate>
        <link>http://rishy.github.io//projects/2014/05/19/gsoc-selection/</link>
        <guid isPermaLink="true">http://rishy.github.io//projects/2014/05/19/gsoc-selection/</guid>
      </item>
    
  </channel>
</rss>