## Exploratory data analysis

The inspection of data is an essential step in the data analysis process, as it enables a comprehensive understanding of the characteristics of the dataset, identification of any inconsistencies or issues.

There are a variety of methods that can be employed to inspect data, such as:

* Examining the data types: It is essential to ensure that all variables are represented in the appropriate data type. This includes ensuring numerical variables are stored as integers or floats, and categorical variables are stored as strings or factors.

* Examination of missing values: A thorough examination for missing or null values in the dataset is necessary. Missing values can pose challenges when building machine learning models, thus it is crucial to identify and handle them appropriately.

* Analysis of data distribution: Visualization of the distribution of variables through histograms or box plots is a valuable method to identify outliers or skewed distributions that may require special handling or transformation.

* Examining the relationships between variables: Scatterplots or correlation analysis can be used to identify relationships between variables. This can assist in identifying potential multicollinearity/redundancy problems with variables.

This notebooks is dedicated for data exploration where i will create data profiling reports for train and test datasets and comparison between them.

In [None]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt

In [None]:
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

In [None]:
numerical_data = train.select_dtypes("number")
numerical_data.info()

In [None]:
categorical = train.select_dtypes(object)
categorical.info()

In [None]:
# Feature separation:
discrete = ['YearBuilt', 'YearRemodAdd','BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
            'BedroomAbvGr', 'KitchenAbvGr','TotRmsAbvGrd','Fireplaces', 'GarageYrBlt','GarageCars', 
            'MoSold', 'YrSold', 'OverallQual', 'OverallCond']

continuous = ['LotFrontage', 'LotArea','MasVnrArea','BsmtFinSF1',  'BsmtFinSF2', 'BsmtUnfSF', 
              'TotalBsmtSF','1stFlrSF', '2ndFlrSF', 'LowQualFinSF','GrLivArea', 'GarageArea',  
              'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch','ScreenPorch', 'PoolArea', 
              'MiscVal']

nominal = ['MSSubClass','MSZoning', 'Alley',  'LandContour','LotConfig',   'Neighborhood', 
           'Condition1', 'Condition2', 'BldgType', 'HouseStyle','RoofStyle','RoofMatl', 'Exterior1st',
           'Exterior2nd', 'MasVnrType',  'Foundation','Heating',  'CentralAir',  'GarageType','MiscFeature',
           'SaleType', 'SaleCondition']

ordinal = ['LotShape', 'LandSlope',  'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
           'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2','HeatingQC', 'Electrical','KitchenQual', 
           'Functional','FireplaceQu', 'GarageFinish', 'GarageQual','GarageCond','PavedDrive', 
           'PoolQC', 'Fence']

In [None]:
# Train report
train_profile = ProfileReport(
    train, 
    title="EDA report for training dataset",
    correlations={
        "phi_k": {"calculate": True},
    },
    )
train_profile.to_notebook_iframe()
train_profile.to_file("../reports/train_dataframe_eda.html")

In [None]:
# Test report
test_profile = ProfileReport(
    test, 
    title="EDA report for test dataset",
    correlations={
        "phi_k": {"calculate": True},
        },
    )
test_profile.to_notebook_iframe()
#train_profile.to_widgets()
test_profile.to_file("../reports/test_dataframe_eda.html")

In [None]:
# Comparison of test and train reports
comparison_report = train_profile.compare(test_profile)
comparison_report.to_file("../reports/comparison.html")

### Key conclusions:

* The target variable, sale price, reveals a right-skewed distribution. This indicates that a small proportion of data points exhibit a much higher value compared to the majority of the observations. This can be visualized in the histogram, where the peak is inclined towards the left, and there is a long tail towards the right. The right-skewness implies that the house prices are not normally distributed, as it deviates from the bell-shaped curve. Such deviation can have an impact on the results of certain statistical methods. It is acknowledged that skewness in real-world data is not an uncommon phenomenon. To mitigate the effect of skewness, various mathematical techniques, such as power transforms, including logarithmic, square root, and reciprocal transformations, can be applied to the data to make it conform to a normal distribution. Also, some of the features exhibit non-linear relationship towards target variable and requires transformation. 

* The dataset is heterogeneous, comprising both numeric and categorical data types. In total, the dataset comprises 79 features and 1460 samples. However, it should be noted that certain features within the dataset have missing values.

* A key observation is that some features appears to exhibit high levels of correlation, which raises concerns regarding multicollinearity. Multicollinearity refers to the scenario where two or more predictor variables are highly correlated, and as a result decrease models stability.

* Furthermore, some features may have low variance, indicating that the values for these features do not vary significantly across the dataset. These features may not contribute new information and their presence will not improve predictive power. 

* Many of the numerical features are represented with integer values, which means that these features can be divided into continuous and discrete categories. It is important to note that some features, such as OverallQl, have been pre-encoded as ordinal object features.


#### Detailed look at missing values

In [None]:
missing = train.isnull().sum()
missing = missing[missing > 0]
missing.plot.bar()
plt.xlabel("Features")
plt.ylabel("Number of missing values")
plt.title("Missing values")
plt.show()

In [None]:
numerical_data.loc[:,numerical_data.isnull().sum()>0].head(5)

In [None]:
train[nominal].loc[:,train[nominal].isnull().sum()>0].head(5)

In [None]:
train[ordinal].loc[:,train[ordinal].isnull().sum()>0].head(5)

* It can be inferred that the majority of missing values observed in the dataset are a result of the absence of the respective feature in the given sample. This is evident in the case of numeric and nominal variables such as 'LotFrontage', 'MasVnrArea', 'GarageYrBlt' 'Alley', 'MasVnrType','GarageType', 'MiscFeature, as well as for ordinal variables such as 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'. However, the missing value for one row in the variable 'Electrical' is peculiar and would require further examination or exclusion of this sample.