**Hi everyone**

**What are we going to do on this notebook?**

* Dataset overview

* Control Missing observations

* Data Visualization for correlation

* Normality and correlation tests




**Thanks for reading. Pls dont forget to upvote ;) Lets start.**

Note : It is not a detailed analysis.

In [None]:
import pandas as pd
import seaborn as sns  
import numpy as np
import warnings
warnings.filterwarnings("ignore")
from matplotlib import pyplot as plt  
from scipy.stats import shapiro     ## Normality Test
from scipy.stats import stats    ## Correlation Tests

In [None]:
dff = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
df = dff.copy()

In [None]:
df.head()

In [None]:
print ("In DataFrame: ", df.shape[0],"obs, and ", df.shape[1], "features" )

In [None]:
df.isnull().sum().sort_values(ascending = False)[0:20]

In [None]:
df.describe().T

In [None]:
numericdf = df.select_dtypes(exclude = ["object"])
numericdf.drop(["Id"], axis = 1 , inplace = True )

In [None]:
numericdf.info()

In [None]:
numericdf.isna().sum()

In [None]:
numericdf["LotFrontage"].fillna((numericdf["LotFrontage"].median()), inplace = True)
numericdf["MasVnrArea"].fillna((numericdf["MasVnrArea"].median()), inplace = True)
numericdf["GarageYrBlt"].fillna((numericdf["GarageYrBlt"].median()), inplace = True)

In [None]:
print("Skewness: %f" % numericdf["SalePrice"].skew())
print("Kurtosis: %f" % numericdf["SalePrice"].kurt())

### Visualization

We will look at the distribution of the target variable and its correlation with other variables

In [None]:
sns.distplot(numericdf["SalePrice"] , color = "c", bins = 100 , hist_kws={"alpha": 0.4});

In [None]:
fg = plt.figure(figsize=(22,22))
for index in range(len(numericdf.columns)):
    plt.subplot(10 ,5 ,index + 1)
    sns.scatterplot(x = numericdf.iloc[:,index], y = "SalePrice", data = numericdf)
fg.tight_layout(pad = 1.0)

In [None]:
figure, ax = plt.subplots(1,6, figsize = (32,8))
sns.violinplot(data = numericdf, x = "OverallQual", y="SalePrice", ax = ax[0])
sns.violinplot(data = numericdf, x = "OverallCond", y="SalePrice", ax = ax[1])
sns.violinplot(data = numericdf, x = "GarageCars", y="SalePrice", ax = ax[2])
sns.violinplot(data = numericdf, x = "Fireplaces", y="SalePrice", ax = ax[3])
sns.violinplot(data = numericdf, x = "YrSold", y="SalePrice", ax = ax[4])
sns.violinplot(data = numericdf, x = "MoSold", y="SalePrice", ax = ax[5])
plt.show()

In [None]:
numericdf.corr()["SalePrice"].nlargest(20)

**Highest positive** correlation variable : OverallQual

let's look at the distribution

In [None]:
(sns.FacetGrid(numericdf,
              hue = "OverallQual",
              height = 8,
              xlim = (0, numericdf["SalePrice"].max()))
.map(sns.kdeplot, "SalePrice", shade = True)
.add_legend());

The correlation table is at the below.

In [None]:
plt.figure(figsize = (14,8))
sns.heatmap(numericdf.corr(),
            cmap = "RdPu" ,
            annot = False ,
            linewidths = 1 ,
            robust = True);

## Normality Test
Main obejctive of performing Normality Tests is to validate the Gaussian distribution of data.

**Shapiro-Wilk Test** :

Tests whether a data sample has a Gaussian distribution.

**Assumption** : Observations in each sample are independent and distributed identically.

**Hypothesis** :

H0: the sample has a Gaussian distribution.

H1: the sample does not have a Gaussian distribution.

In [None]:
for i in numericdf :
    test_statistics, pvalue = shapiro(numericdf[i])
    print(f"Shapiro Test Statistics for {i}: = {test_statistics:.4f}, P-value = { pvalue:.5f}")

**Decision** : Since P values are < 0.05 . No Gaussian distribution for all variables

## Correlation Tests
Correlation Tests are used to check the correlation between two independent features or variables.

**Spearman’s Rank Correlation** :
Tests whether a data sample is montonically separable.

**Assumption** : 1-Observations in each sample are independent and distributed identically. 2- Observations in each sample are ranked 

**Hypothesis** :

H0: the samples are correlated.

H1: the sample does not have any correlation.

In [None]:
for i in numericdf:
    test_statistics, pvalue = stats.spearmanr(numericdf[i],numericdf["SalePrice"])
    print(f"Spearman-Correlation Coefficient for {i}: = {test_statistics:.4f}, P-value = { pvalue:.5f}")

**Kendall’s Rank Correlation** :

Tests whether a data sample is montonically separable.

**Assumption** : 1- Observations in each sample are independent and distributed identically.  2- Observations in each sample are ranked .

**Hypothesis** :

H0: the samples are correlated.

H1: the sample does not have any correlation.

In [None]:
for i in numericdf:
    test_statistics, pvalue = stats.kendalltau(numericdf[i],numericdf["SalePrice"])
    print(f"KendallTau-Correlation Coefficient for {i}: = {test_statistics:.4f}, P-value = { pvalue:.5f}")

**Decision** : 

P-values < 0.05 is statistically significant. 

P-values > 0.05 is not statistically significant. 