# Introduction

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Before making inferences from data it is essential to examine all your variables.
Why?

1). To listen to the data

2). To catch errors, anomalies

3). To see patterns in the data

4). To find violations of statistical assumptions

5). To generate hypotheses

Exploratory data analysis involves a number of processes or activities including :-

1). Generating and analyzing descriptive statistics for each of the features

2). Checking correlations

3). Checking outliers

4). Analyzing target variables

5). Finding errors and anomalies in the features

## Most Important Exploratory Data Analysis Questions

- How will you understand Target Variable Distribution, and why is this so important ?
- How can you visually discover Correlation between features, and why is it so important ?
- How can we identify Outliers in a given Feature Space ?
- How can we find Distribution-Skewness in a given Feature Space ?



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

#Setting Style for Plotting
plt.style.use('fivethirtyeight')

To starts with,I imported necessary libraries (for this example pandas, numpy,matplotlib and seaborn) and loaded the data set.

## Loading Data and Initial Exploration to Understand Data

In [None]:
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
df.head()

Find total number of rows and columns in the dataset using shape 

In [None]:
df.shape

It is also a good practice to know the columns and their corresponding data types, along with finding whether they contain null values or not.

In [None]:
df.info()

# Descriptive Statistics

Descriptive statistics can give you great insight into the shape of each attribute.

Often you can create more summaries than you have time to review. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute:

a). Count

b). Mean

c). Standard Deviation

d). Minimum Value

e). 25th Percentile

f). 50th Percentile (Median)

g). 75th Percentile

h). Maximum Value

The describe() function in pandas is very handy in getting various summary statistics.This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.

The describe() function in pandas is very handy in getting various summary statistics.This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.

There is notably a large difference between 75th %tile and max values of predictors “residual sugar”,”free sulfur dioxide”,”total sulfur dioxide”.
Thus observation suggests that there are extreme values-Outliers in our data set.


In [None]:
df.describe()

# How will you understand Target Variable Distribution, and why is this so important ?

Few key insights just by looking at dependent variable.

1. Target variable/Dependent variable is discrete and categorical in nature.
2. “quality” score scale ranges from 1 to 10;where 1 being poor and 10 being the best.
3. You can identify class imbalance which can help you understand and hopefully fix classification errors at a later stage

In [None]:
df.quality.unique()


This tells us vote count of each quality score in descending order.“quality” has most values concentrated in the categories 5, 6 and 7.
Only a few observations made for the categories 3 and 8.

In [None]:
df.quality.value_counts()

In [None]:
df['quality'].hist()

# How can you visually discover Correlation between features ?

To use linear regression for modelling,its necessary to remove correlated variables to improve your model. One can find correlations using pandas “.corr()” function and can visualize the correlation matrix using a heatmap in seaborn.

Dark shades represents positive correlation while lighter shades represents negative correlation.
If you set annot=True, you’ll get values by which features are correlated to each other in grid-cells.

1). Here we can infer that “density” has strong positive correlation with “residual sugar” whereas it has strong negative correlation with “alcohol”.

2). “free sulphur dioxide” and “citric acid” has almost no correlation with “quality”.

Since correlation is zero we can infer there is no linear relationship between these two predictors (“free sulphur dioxide” and “citric acid”).However it is safe to drop these features in case you’re applying Linear Regression model to the dataset.

In [None]:
fig, ax = plt.subplots(figsize=(15,7))
sns.heatmap(df.corr(),cmap='viridis', annot=True)

# How can we identify Outliers in a given Feature Space ?

This is a commonly overlooked mistake we tend to make. The temptation is to start building models on the data you’ve been given. But that’s essentially setting yourself up for failure.

Data exploration consists of many things, such as variable identification, treating missing values, feature engineering, etc. Detecting and treating outliers is also a major cog in the data exploration stage. The quality of your inputs decide the quality of your output!


In [None]:
l = df.columns.values
number_of_columns=12
number_of_rows = len(l)-1/number_of_columns
plt.figure(figsize=(number_of_columns,5*number_of_rows))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.set_style('whitegrid')
    sns.boxplot(df[l[i]],color='green',orient='v')
    plt.tight_layout()

# How can we find Distribution-Skewness in a given Feature Space ?

According to Wikipedia,” In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.”

In [None]:
plt.figure(figsize=(2*number_of_columns,5*number_of_rows))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sns.distplot(df[l[i]],kde=True) 

In [None]:
print("Skewness  \n ",df.skew())
print("\n Kurtosis  \n ", df.kurt())

# Skewness and Kurtosis on House Price Prediction


In [None]:
# Read and load Data
train = pd.read_csv("../input/housepricesadvancedregressiontechniquestrain/train.csv")
train.describe()

In [None]:
#Plot Histogram for 'SalePrice'
sns.distplot(train['SalePrice'])

In [None]:
# Skewness and Kurtosis
print("Skewness : %f" % train['SalePrice'].skew())
print("Kurtosis : %f" % train['SalePrice'].kurt())

In [None]:
target = np.log(train.SalePrice)
print("Skewness : %f" % target.skew())
print("Kurtosis : %f" % target.kurt())

# Finding Outliers

Focusing on outliers, defined by Gladwell as people who do not fit into our normal understanding of achievement. Outliers deals with exceptional people, especially those who are smart, rich, and successful, and those who operate at the extreme outer edge of what is statistically plausible. An outlier is a data point that is distant from other similar points. They may be due to variability in the measurement or may indicate experimental errors. If possible, outliers should be excluded from the data set. We'll do a quick analysis through the standard deviation of 'SalePrice' and a set of scatter plots.

In [None]:
train = train[train['GarageArea'] < 1200]


In [None]:
# Histogram and normal probability plot
import seaborn as sns
from scipy import stats
from scipy.stats import norm

sns.distplot(train['SalePrice'], fit = norm)
fig = plt.figure()
res = stats.probplot(train['SalePrice'],plot = plt)

Please Upvote, Share and Comment to show your Support and Appreciation. Thanks for all the support.