# Data Science - House Price Prediction By Example 

## 1. Loading the libraries
  - It is good practise to consolidate the import statements in one place to understand what needs to be availble in order to run the notebook

## 2. Loading and reviewing the datasets
  - Several datasets are loaded including datasets for training and testing.
  - Calling `DataFrame#head` and `DataFrame#tail` on the dataset is encouraged to understand what's contained in the dataset and if there is unwanted data or information on the bottom of the DataFrame like notes etc.

## 3. Finding `null`/`na` values
  - columns may or may not contain empty values (`null`, `na`, `nan`)
  - as features (predictors/regressors) may be required to predict a value, an understanding of the quality of the data is crucial for creating a model, as it might lead to an (unwanted) reduction of the depending on the model
  - by executing `DataFrame.isnull().sum().sort_values(ascending=False)`, a list of features is printed by the number of `null` values in descending order

## 4. Exploring the Dataset

### Pandas API

When talking to the Pandas API, the `DataFrame` datatype is usually whats used to interact with the data. Understanding DataFrames is required for a basic interaction with the data.

#### DataFrame#describe

The `describe` function is used to give a first overview of the data and can be called on a DataFrame or one of its columns

| Property   | Description |
|------------|--------------|
| count      | The number of entries with real values |
| mean       | The mean value of a column             |
| std        | The standard deviation                 |
| min        | The min value of a row                 |
| 25%        | The value of the 25% percentile        |
| 50%        | The value of the 50% percentile        |
| 75%        | The value of the 50% percentile        |
| max        | The max value of a row                 |

##### Standard Deviation

A low value in the `std` mean, the data is spread close to the mean, while a high `std` indicates a lot of variance in the data with very small and very high values, compared to the mean.

### Understanding The Domain - Inspecting the target

The `SalePrice` column is an interesting column as it is the "target" of our prediction. `desfribe` can be called on a DataFrame column, which returns a summary of the column just as it did for the whole DataFrame (/table). By inspecting the data, we could reason about the datasets validity (which generally requires some knowlege of the domain were operating in).

### Explorative Data Analysis With Visualizations 

#### Matplotlib Histogram

By plotting the price in relation to the count of houses, the general distribution of houses in certain price ranges gets visualized. In the given notebook a dataset is loaded with most houses in a price range from 100000 USD to 300000 USD.

### scipy.stats - Statistics

From `scipy.stats` the `norm` is imported and used. The norm function is used for:
```
fit(data, loc=0, scale=1)
    Parameter estimates for generic data.
```
The function returns data for the location and the scale ⭢ mean and std?

> The fit method of the distributions can be used to estimate the parameters of the distribution, and the test is repeated using probabilities of the estimated distribution.

##### Question
Why did we do it and render a normal distribution based on the calculated values? Was it done for comparison reasons with the graph we rendered for `SalesPrice`?

### Exploring Features

Categorical data can be understood - at least in terms of frequency - by simply using a bar chart. Other features could be explored and understood by other visualizations, i.e. scatter plots. This is done to reduce the data to the features necessary to make a valid prediction.

### Finding Correlations Between Features

Features might be redundant if one feature influences a second feature, so we're searching for features which are highly correlated with the `SalesPrice`. Plotting a heatmap shows some very dark spots which indicate features, which relate strongly with the `SalesPrice`.

By plotting correlation matrices with `k` variables, identifying independed variables gets more reliable. The seaborn heatmap plots the correlation coefficient in the crossing cell. By comparing the resulting values and appying domain knowledge, important and probably redundant features can be isolated or removed.

Seaborn also allows to do draw a similar matrix by plotting graphs in each cell, which also gives a good overview.

___Using this matrix, we're also lookig for patterns between the features.___

## 5. Linear Regression

By applying linear regression to various independent, we can inspect the accuracy for each predictor independently.

### R-squared

R-squared is used to explain variability of the target variable. The higher the value, to more likely it is that the value can be used to model a prediction for the target value.

### $\hat{\beta}_0$

> If we don't have any Overallquality (`OverallQual`) factor in the picture, we would expect Saleprice (y) to be -9.621 exponentially raised to power of 4. In other words it is very small.

Does that mean, that `OverallQual` is a major contributing factor which contributes to the overall magnitude of the target variable?


