# <center>How to create a good EDA[thoughts]</center>

---

Now, in such troubling times, with the influx of kernels on Coronavirus as well as the exploration of different aspects of the coronavirus (i.e transmission, geospatial factors etc.), it is more important than ever to create high-quality public kernels that help to share perspectives, insights and analyses about the coronavirus. So, following in the footsteps of the illustrious legend JohnM, I have given my own thoughts as to why and how to deliver a solid yet well-presented analytics report on not only the current coronavirus, but any topic, even for competitions, because despite what it seems, competitions are "real-world problems".

# **1. Structured Analysis is always better**

Now, let me ask you something. I'm going to present two hypothetical notebooks and ask which one would you prefer. 

### Notebook 1

* Starts off with a detailed introduction, which is informative yet poignant.
* Looks at the data initially.
* Starts by analysing variables initially (univariate analysis), and then continues in the same vein by analysing relationships (bivariate analysis).
* Follows through with a look at the outliers in the data, and how they affect it.
* Creates features, and looks at correlation and then determines best ones to use.
* Conclusion gives a short and sweet summary of what has been used in the notebook and where to go on from there.

### Notebook 2

* Starts off with just making a few statements.
* Spends too much time inspecting data and makes premature assumptions from even the smallest glimpse.
* Does not use a structured form of analysis, instead just mixes up univariate and bivariate analyses and makes convoluted assumptions based on the data.
* Outliers are not even looked at.
* The features are generated, but the values in most of the features are NaN and are not correlated at all with the data.
* Conclusion is nonexistent.

If I had to ask you, and I have to express my own personal opinion, it is obvious which one you would pick. The first one has a **definite structure to it.** 

Look at any of <a href="www.kaggle.com/headsortails">Heads or Tails</a>'s Rmarkdown reports. Those notebooks are literally works of art. A structured EDA is better, as the viewers/reviewers will be more easily able to identify and make their own inferences from the data (in general, inferring from the data will be much easier if there is a **present, discernible structure.**)

Speaking of Heads or Tails, here are his thoughts (from his <a href="https://web.archive.org/web/20190709103039/http://blog.kaggle.com/2018/06/19/tales-from-my-first-year-inside-the-head-of-a-recent-kaggle-addict">interview with Kaggle</a>)

---

> My Kernels focus primarily on detailed EDA - ideally with a baseline prediction model derived from it. The full kernel takes me about a week, maybe two, depending on the complexity of the data set. Since I love exploring, I normally aim to make quick progress in the early days of the competition to have a comprehensive view of the data.

---

> I normally have the fundamental properties of the data set covered with a day or two, by which time I have a also defined a roadmap on how to conduct the more detailed analysis of individual features. As this analysis progresses, other insights are likely to be revealed that merit a dedicated follow-up treatment. Learning new analysis tricks and methods takes up at least a couple of hours per kernel. There is always something new to learn which is great.

---

> I prefer to break my EDA into distinct but related parts such as the single- vs multi-parameter visualizations, correlation tests, or feature engineering. Those fundamental steps are similar from kernel to kernel, but their extent and importance can vary dramatically. This approach makes it easy to keep an overview of the big picture. Attempting to write an entire analysis kernel in one go can be a daunting task, especially for beginners, and I would advise against it.

---

As you can see here, Heads or Tails always has his kernels in a structured order, which is why he was the first Kernels GM on Kaggle.

# 2. Try to deliver understandable plots

Let's take examples from Kaggle kernels, specifically, the NFL 1st and Future analytics competition.

First, from Jason Zivkovic's kernel:
<img src="https://www.kaggleusercontent.com/kf/22012031/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..heizFJ255E_T_pAv8NNyPQ.Ixf1t38HUh2w5e0oRZrswx6a683Uq2MlAL-UQibNbCiX3aSxbtWTIhXj5YN56gRPPf6sYhWsUUwI57UPdM63ftuxc_6XqGEF6epgetuMSAuQ3PBOzDRpYAYm-oNpBNeUJZNLcu_OdgnVmmDMevfs2DBlNrh3qmis8YJhHazD0wloniNudPGTLguKy4wG-p9403Uu7cbHYsSOhf_rG6SHWlDbnLqP50OBVSnYoMeWPwwU9vHcePIhib5_7k8Jh2DSPkK4dwQMrwT94ofX29o3tYLCWzS_cJK4XvVXHkCKM9LmbpP9plLqiwR2rQgi95oFA1bku2kEmvap4hjX4ucLDGKo8R10DdiHvxBYKATo6aSTR5pdBlZBQ2_I9_0LmixpQ2w2sthjy_zMOqjMtRuJiRXzeD4oGXy1z2zjhrJ3bkSAdS1Bdj6nymLH7a4Pno7QNk7u1yyk4oaqKvxMgu2ptXqgEYtOuBhswP8GVvO-RChM8r3SH2uQEodmbYfnF_IcbJMSyB7_ULWj_Umzy98r6F26y8CkzhqeSjYeAAw6i_h4VC7YSu_GE5sisQ64TPbxScyWAFyDQGBdsiy6zGu-W6eSekXIVzRo8M2ZonEWp_B2KrzG5gU0DEhdSXJND6hQi8uQDp0pKwosWsdBJtOtSHEkpIEMwp1JyRkTc-Px4GqCFNpiTQkyuSkTMyTr_ilS.isw0pJ4rTKQUog8UN91-Zw/__results___files/figure-html/unnamed-chunk-11-1.png"></img>

Now, from Ethan Schacht's kernel:
<img src="https://www.kaggleusercontent.com/kf/22280527/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..m09sWftjD0ZNSpAD4LkouQ.uM_jB0qQsKhVRBx1VCnVauPS7s7c8wIF8LKRykDZYNJDJ9fP77pz0WJVJ_A0gW3KxM5l7lmG9UQcqU6EOZK-3ySjvSIaS9qxxhsxZ1oySR8XMw55iLczHRI5TD4hnCah9Rww5yl6knuECYY9Z_q5KDPknBh8CdQUCQ20HmmAInOClVrp8v1jMbrdzReCo_ejdHlVYdOtcBxGBXmfaQZhYPCmfqxeGaQXqQ9Ps7wkvp2byicr_0bE_dTdLC1N7SobWLa_tHQnvfFrGWox8ZSOhv_huk4jhZjhqQ7oaZr3tOBL2oQZ_9eZKL9iTpS2um8Ay3lQRc-KrjkrvyK6Pae9GvcWf_pBFZ41uz-mmhuWja3ibMG5-1Edtg0n-0nzx5q_XVaw7aszyddr-Aqnae14DBl4Iktgys8uZHrONCbj820KZZ0OTvDNRVmpa2FrQ1gxO2qRPTr4XvSKkebKcR-UN7ivIUGBSEvNcVEYAEvGdlb_snirleoZZekIGNN1KyvJJz3CTUEdaUSMmUvl08fnaEkZZpjEmI3HefSnPhvbu7yI2X09y1NbwOJcLCxEFw0yoh1yZr4xP8HvoLUp2YHRJ1NUqn1Lnf3C9VuNaoXb8dbeGcdevw-dbBf1tvy9EUVCpdl3AyhPAQ0Ca3quo_BAch6lTb8azN55LmQSGNY4AOsMGUDvgOA6vOhK-THWi6gZ.WaG-TWu3NXKf0l2B4JIxag/__results___files/figure-html/unnamed-chunk-15-1.png"></img>

The point is that both these kernels deliver simple yet understandable results; I particularly am a big fan of Jason Zivkovic's kernels (but that's just me). Both the visuals leave a mark and deliver a simple yet understandable result.

With an EDA, there's something I like to call a perfect balance; the balance between a bar chart and a detailed correlation + scatterplot visualization. The perfect balance is the visuals not being so simple that they become bland; and the perfect balance is the visuals not being so complex that it is impossible to tell what the heck is going on. That metaphorical "perfect balance" is always present in poignant yet beautiful EDAs; being simple but not too simple and at the same time being detailed but not too in-depth. This balance is the reason why some EDA kernels have 1000s of upvotes.

# 3. Choice of plots

The choice of plots is a very important factor when someone tries to pull off an EDA; it can affect a lot of things such as how easy it is to infer from it and how visually and aesthetically pleasing it is to witness a plot. Heads or Tails has spoken about this numerous times:

---

> Beyond that, I’m a big fan of data visualization and engaging narratives. The right plot says way more than the proverbial 1000 words. Stringing together a series of visuals to dive deep into the meaning behind the data is a feat that I will always be impressed with.

---

The right plot says more than a 1000 words. So what is the right plot? Well, let's see. It's always kind of tricky to think about which plot to use in an EDA, I mean you are literally spoilt for choice with `matplotlib`, `seaborn` and `plot.ly`.

Now, this is a simple example on how to use bar plots. I recommend you think about what kind of plot to use based on the kind of data you work with.

* Bar plot
```
    Whenever you want to show a certain variable (x-axis) against a certain count (y-axis) with the variable having a distinct feature that makes it interesting. A simple bar plot is OK, but an actual, decent barplot must be able to make you instantly discern trends in the data.
   ```
   
Hopefully, you can understand where and how to use plots by reading some of these guys' kernels: 

* **<a href="https://www.kaggle.com/jaseziv83">Jason Zivkovic</a>**
* **<a href="https://www.kaggle.com/ambarish">Bukun</a>**
* **<a href="https://www.kaggle.com/headsortails">Heads or Tails</a>**

# 4. A well-defined question that your EDA tries to solve

A pointless EDA, just going about and looking at random variables that make 0 sense is the same thing as trying to create a neural network  without any features to use. A well-defined question in the former is the same things as decent features in the latter. The well defined question gives your EDA a purpose; something to actually do instead of just "oh, pie chart here. oh, bar chart there. oh, maybe add something interactive! i'll get votes!". A question actually gives your EDA something to do. 

For instance, look at the opening section of one of Jason Zivkovic's kernels:

---

> The game features a number of different playing modes, however Career mode as a manager holds the most appeal for me.
The following analysis will be tailored toward having the best chance at success in that mode for anyone interested.

---

> Some things I want to analyse in this paper:
* Which features are highly correlated with a player’s overall rating by player position
* Analyse the differences between a player’s current rating and their potential rating
* Find out which teams have the highest potential
* Find out the youngest teams / oldest teams
* Use k-means clustering to try to find “bargains”; ie if there is someone with the same skills/potential, can they be found for a bargain?

# 5. The conclusion of your EDA

An EDA is basically something which drives you towards a conclusion which will aid your modeling. An EDA is **essential** if you want to model. So, your EDA **must** have a discernible conclusion which will help your modeling somehow. Think of a problem in data science as buying a car. TO buy a car, one has to get a good feel, one has to be able to understand some discernible advantage over the car over another car. The conclusion of your "feel and discernible advantage" must help you somehow in buying the car.

An EDA is a lot like buying a car. So, if you want a good car or a good model, you need to look around with what you have, look at some feature which might or might not prove advantageous. The holy grail of EDA is:-

> If you torture the data long enough, it will confess.

EDA is the torture and the closing remarks of your EDA is the confession. 

# 6. The Best EDA Kernels

You can look at Shivam Bansal's [notebook](https://www.kaggle.com/shivamb/data-science-glossary-on-kaggle/notebook) . It contains a comprehensive list of the best EDA kernels.