# Exploratory Data Analysis
![](https://fiverr-res.cloudinary.com/images/t_main1,q_auto,f_auto,q_auto,f_auto/gigs/136400130/original/4ba618b0ca533da4e275a05cc4013f6b982cb874/machine-learning-android-development.jpg)

# Exploratory Data Analysis

* The first step in any data science project
* Suggested in 1962 by John Tukey, an American mathematican who is well-known for Fast Fourier Transform algorithm, box plot, Tukey range test, Tukey lambda distribution, Tukey test of additivity, and Teichmüller–Tukey lemma.
* Tukey presented simple plots (e.g. box plots, scatterplots) that, along with summary statistics (mean, median, quantiles, etc.), help paint a picture of a dataset.
![](https://upload.wikimedia.org/wikipedia/en/e/e9/John_Tukey.jpg)

# Elements of Structured Data

* **Numeric**: Data that are expressed on a numeric scale
    * **Continuous**: Data that can take on any value in an interval.
        * Synonyms: interval, float, numeric.
        * Example: wind speed or time duration
    * **Discrete**: Data that can take on only integer values, such as counts. 
        * Synonyms: integer, count
        * Example: count of the occurrence of an event
* **Categorical**: Data that can take on only a specific set of values representing a set of possible categories.
    * Synonyms: enums, enumerated, factors, nominal, polychotomous
    * **Binary**: A special case of categorical data with just two categories of values (0/1, true/false). 
        * Synonyms: dichotomous, logical, indicator, boolean
        * Example: 0/1, yes/no, or true/false
    * **Ordinal**:Categorical data that has an explicit ordering. 
        * Synonyms: ordered factor.
        * Example: numerical rating (1, 2, 3, 4, or 5).

### Even unstructured data need to be transformed to structured format before any analysis

# Estimates of Location - Mode, Mean, Median, Range
![](https://www.theschoolrun.com/sites/theschoolrun.com/files/article_images/mode_mean_median_range.png)

# Estimates of Variability - Standard Deviation
![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Standard_deviation_diagram.svg/1200px-Standard_deviation_diagram.svg.png)

# Standard deviation is sensible to outliers
![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRJRUzmCWzCoTx9X8l3urqMFE-Nxw_BvNGvVWEy8kqPdZMkpQMl&s)
# A robust estimate of variability - median absolute deviation from the median
![](https://www.asprova.jp/mrp/glossary/en/fig/mrp_173-2.jpg)

# Outliers
![](https://mathworld.wolfram.com/images/eps-gif/OutlierHistogram_1000.gif)
![](https://cdn.kastatic.org/googleusercontent/8bSRVB7q_zWxFliXcZVQSBDtip3sMGRkkHGLVzvflS3goQZZhmhrSD9u1cSduXh-9DJ9sSjCqVyozwQ_FwJNkptC)

![](https://slideplayer.com/slide/12859452/78/images/57/Outlier+detection+with+z-score.jpg)

# Exploring the Data Distribution - Boxplot
* A plot introduced by Tukey as a quick way to visualize the distribution of data.
![](https://miro.medium.com/max/18000/1*2c21SkzJMf3frPXPAR_gZA.png)

# Exploring the Data Distribution - Frequency table
![](https://www.s-cool.co.uk/assets/learn_its/gcse/maths/representing-data/graphs-and-charts-2/2007-10-23_114052.gif)

# Exploring the Data Distribution - Histogram
![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Histogram_of_arrivals_per_minute.svg/1200px-Histogram_of_arrivals_per_minute.svg.png)

# Exploring the Data Distribution
## Density plot
* A smoothed version of the histogram, often based on a kernel density estimate.
![](https://datavizcatalogue.com/methods/images/top_images/SVG/density_plot.svg)

# Exploring the Data Distribution
## Density plot - Skewness & Kurtosis
![](https://www.researchgate.net/profile/Attila_Bonyar/publication/298415862/figure/fig1/AS:340236723867648@1458130164255/Illustration-of-the-skewness-and-kurtosis-values-and-how-they-correlate-with-the-shape-of.png)

# Exploring Binary and Categorical Data
## Mode
* The most commonly occurring category or value in a data set.
![](https://statistics.laerd.com/statistical-guides/img/mode-1a.png)

# Exploring Binary and Categorical Data
## Bar charts
* The frequency or proportion for each category plotted as bars.
![](https://chartio.com/assets/f7272d/tutorials/charts/bar-charts/4231f2343fb3c86edcd9e468db398d33c18ab1c16c6d38a1afedf0e06ca8fc4d/bar-chart-example-3.png)

# Exploring Binary and Categorical Data - Pie charts
![](https://upload.wikimedia.org/wikipedia/commons/6/63/Pie-chart.jpg)

# Relation of 2 continous data
## Correlation
![](https://www.simplypsychology.org/correlation.jpg)

# Relation of 2 continous data - Correlation matrix
![](https://miro.medium.com/max/666/1*EuqHC0-iQpNW6yNMJdpbnA.png)

# Relation of 2 continous data
## Scatter Plot
![](https://www.researchgate.net/profile/Muhammad_Irwanto2/publication/271635547/figure/fig7/AS:668701388992522@1536442243891/The-scatter-plot-of-solar-radiation-versus-air-temperature-A-simple-linear-regression.png)

# Relation of 2 continous data
## Hexagonal binning plot
![](https://seaborn.pydata.org/_images/hexbin_marginals.png)

# Relation of 2 continous data
## Contours Plot
![](https://i.stack.imgur.com/sOQwJ.png)

# Relation of Two Categorical Variables
## Contingency Table
![](https://slideplayer.com/slide/1650553/7/images/22/Contingency+Table+Example.jpg)

# Relation between Categorical and Numeric Data
## Box Plot
![](https://www.wellbeingatschool.org.nz/sites/default/files/W@S-different-boxplots.png)

# Relation between Categorical and Numeric Data
## Violin Plot
![](https://seaborn.pydata.org/_images/seaborn-violinplot-2.png)

# Relation between Categorical and Numeric Data
## Multiple Density Plots
![](https://i.stack.imgur.com/vyznx.png)

# Relation between Categorical and Numeric Data - Ridge Plot
![](https://seaborn.pydata.org/_images/kde_ridgeplot.png)

# Relation between Categorical and Numeric Data - When categorical data has too many values
![](https://semiotic.nteract.io/assets/img/ridgeline.png)

# Relation between 2 Categorical Data and 1 Numeric Data
## Violin Plot
![](https://seaborn.pydata.org/_images/seaborn-violinplot-3.png)

# Relation between 1 Categorical Data and 2 Numeric Data
## Multiple Density Plots
![](https://i0.wp.com/www.sthda.com/sthda/RDoc/figure/ggpubr/arrange-multiple-ggplots-marginal-plot-grouped-data-1.png?w=450)

## Change of Numeric Data Overtime (Also, change over time of numeric data with different values of categorical data)
![](https://datavizf18.classes.andrewheiss.com/class/06-class_files/figure-html/ridge-temp-month-gradient-1.png)

![](https://www.eteacherlk.com/wp-content/uploads/2018/04/maxresdefault.jpg)

# Does difference in categorical data values lead to difference in numerica data distribution?
## ANOVA - Analysis of Variance
![](https://www.questionpro.com/blog/wp-content/uploads/2016/03/rsz_anova-800x444.jpg)

# Does difference in categorical data values lead to difference in numerica data distribution?
## ANOVA - Analysis of Variance
![](https://i.pinimg.com/originals/eb/94/f9/eb94f9bae12be2d6617549bd22e7d216.jpg)

# Does change in 1 numeric data lead to change in another numerica data?
## Linear Regression
![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/1200px-Linear_regression.svg.png)

# Does change in 1 numeric data lead to change in another numerica data?
## Polynomial Regression
![](https://animoidin.files.wordpress.com/2018/07/polim_vs_linear.jpg?w=909&h=584&crop=1)

# Does change in 1 categorical data lead to change in another categorical data?
## Chi-Square Test
![](https://cdn.analystprep.com/cfa-level-1-exam/wp-content/uploads/2019/08/05104731/page-189.jpg)

# Does change in 1 categorical data lead to change in another categorical data?
## Chi-Square Test
![](https://image.slideserve.com/768344/chi-square-test-l.jpg)

# Does change in 1 categorical data lead to change in another categorical data?
## Chi-Square Test
![](https://i.pinimg.com/originals/9b/b7/b8/9bb7b8a0861827ffca359d96fe83a557.png)

# Feature Engineering
![](https://miro.medium.com/max/3154/1*DoNn5kB0I1BTEjhO2D3yOA.png)

# Feature Engineering
![](https://www.topbots.com/wp-content/uploads/2019/09/cover_1600px_web-1280x640.jpg)

# Feature Engineering
![](http://lambda-xmu.club/img/Machine_Learning_Workflow.png)

# Feature Engineering
![](https://machinelearningcoban.com/assets/FeatureEngineering/ML_models.png)

# Feature Engineering
![](https://miro.medium.com/max/9706/1*bNwd6mp1uzyx45uHnm1VWA.jpeg)

# Unsupervised Learning & Clustering
![](https://quantdare.com/wp-content/uploads/2018/03/plot.png)