# Final Project Template

For the final project for this module, you are asked to use data analysis techniques and linear regression to create a model to predict housing prices. 

In Video 7.9, Dr. Williams presented you with an example of data analysis in which housing prices were predicted by using just the columns `OverallQual` and `MassVnrArea` from the data provided. In Video 7.10, Dr. Williams showed more examples of data visualization and manipulation in addition to a more detailed analysis of the data.

Your challenge in this project is to improve Dr. Williams' results from Video 7.9 by choosing different variables in the *dataframe* to create your model. Although in Video 7.10 you are offered a sample data analysis which uses five columns from the data provided, your project submission must include an analysis of at least three additional variables and offer other solutions that improve the results obtained by Dr. Williams in these two videos.

Before you fill out the project outline template below, make sure you:

- Read through the template completely to understand the instructions for the structure of the project.
- Have a clear understanding of what to do to create a model that will return the results you want to find.
- Use Markdown to edit the template.

<div class="alert alert-block alert-success">
The purpose of this Jupyter Notebook is to give you a structure to follow when you are solving your problem and developing your model with Python. Make sure you follow it carefully. You can add more subsections if needed, but remember to fill out every section provided in the template.
</div>

<div class="alert alert-block alert-danger">
Delete all cells above, including this one, before submitting your final Notebook.
</div>

# Prediction of Housing Prices

**Giovanni Bernal Heredia**

# Index

- [Abstract](#Abstract)
- [1. Introduction](#1.-Introduction)
- [2. The Data](#2.-The-Data)
    - [2.1 Import the Data](#2.1-Import-the-Data)
    - [2.2 Data Exploration](#2.2-Data-Exploration)
    - [2.3 Data Preparation](#2.3-Data-Preparation)
    - [2.4 Correlation](#2.4-Correlation)
- [3. Project Description](#3.-Project-Description)
    - [3.1 Linear Regression](#3.1-Linear-Regression)
    - [3.2 Analysis](#3.2-Analysis)
    - [3.3 Results](#3.3-Results)
    - [3.4 Verify Your Model Against Test Data](#3.4-Verify-Your-Model-Against-Test-Data)
- [Conclusion](#Conclusion)
- [References](#References)

[Back to top](#Index)


##  Abstract

This is a brief description (150 words or less) of your analysis and results of your prediction model. Complete this portion of the template after you are done working on your project.

[Back to top](#Index)


## 1. Introduction
This projects analyzes and predicts housing prices from housing data. The goal is to predict housing prices from multiple features or charcteristics of houses sold. A data set of houses sold is provided. To analyze and predict housing prices, we followed the following procedure:
1. Read/Parse housing prices data file
2. Clean the data 
3. Analyze the data and identify the most relevant characteristics/features
4. Train a model
5. Predict housing prices for data outside the training data
6. Verify predictive capabilities with unseen data


[Back to top](#Index)

## 2. The Data



[Back to top](#Index)

### 2.1 Import the Data

To import the data, we use the **pandas.read_csv()** function from the **pandas** library.

The data provided contains multiple categorical and numerical features of houses sold. The data contains 100 observations (rows) and includes 80 features (columns). You can find the sale price of the houses among those. It also includes features like “Overall Quality” or “Garage Area." Furthermore, not all data is complete and there are missing values or NaN.

Below are included the first and last 5 rows from a subset of columns.

![data_description.png](attachment:data_description1.png)

[Back to top](#Index)

### 2.2 Data Exploration

Statistics for the data were produced by using the **DataFrame.describe()** function. The table produced is shown below. It provides an idea of the ranges and distribution of the data.

![sale_price_hist.png](attachment:describe_statistics.png)


The most important variable is the sales prices. The following histogram shows the distribution of the housing prices. The sales price does not follow a normal distribution as shown below. The mean is 173'820 and the standard deviation is 72'236.

![sale_price_hist.png](attachment:sale_price_hist.png)

The sales price data shows a logarithmic distribution: we plotted the distribution of the logarithm of the sales prices.

![hist_sale_price_log.png](attachment:http://localhost:8888/view/hist_sale_price_log.png)


The log of the sales prices produce a much smaller skewness (-0.1) vs. the regular sales prices (1.18). Skewness is a measure of the asymmetry.

The following scatter plots show the relationship between multiple features and the sales prices.

![OverallQual.png](attachment:http://localhost:8888/view/OverallQual.png)

![MasVnrArea.png](attachment:MasVnrArea.png)

![GrLivArea.png](attachment:GrLivArea.png)



![hist_sale_price_log.png](attachment:hist_sale_price_log.png)

[Back to top](#Index)

### 2.3 Data Preparation

We could only consider the numerical values for our prediction analysis. So, we decided to filter by the type of data: only allowing columns with numerical values. Once that was performed, we verified that all the values in the remaining columns did not contain any NaN. We did by running the **DataFrame.isnull().sum()** function. The columns with NaN values are shown next with the number of null values.

|Column     |Number of NaN|
|-----------|-------------|
|LotFrontage|           14|
|GarageYrBlt|            6|
|PoolQC     |          100|

For PoolQC, whole column was dropped as all values were NaN. The missing values for the other columns were replaced by interpolating with **** 
Determine if there are any missing values in the data. Did the data need to be reshaped? If yes, include a description of the steps you followed to clean the data.


[Back to top](#Index)

### 2.4 Correlation

Describe the correlation between the variables in your data. How can the correlation help you make an educated guess about how to proceed with your analysis? Will you explore different variables based on the correlation you found? If so, describe what you did and be sure to include what you found with the new set of variables.


[Back to top](#Index)

## 3. Project Description

Describe, using 150 words or less, how your analysis improves upon the analysis performed by Dr. Williams. Explain the variables that you analyzed, why you selected them, and what relationships you determined in your analysis.
Make sure you explain specifically what findings you derived from your analysis of the data.


[Back to top](#Index)

### 3.1 Linear Regression

The multiple linear regression model was used.
Multiple features had a strong correlation with the price.
The relationship to the Sales price for those features was approximately linear.
Linear regression is a straightforward technique to implement and can provide good predictive capabilities for the existing data.
The Skitlearn module was used. Specifically, the LinearRegression() object.
The multiple linear regression model can be represented with the equation:

$$Y = a_{0} + a_{1}X_{1} + a_{2}X_{2} + ... + a_{p}X_{p}$$

Where:  
- $Y$ is the target (dependent variable). E.g., Sales Price 
- $X_{i}$ denotes the predictors $i$ (independent variables). E.g, Overall Quality
- $a_{0}$ represents the intercept or bias
- $a_{i}$ represent the estimated parameters (slope or weight) for each independent variable $i$

The linear regression estimates the model parameters by minimizing the sum of the square values of the error. 


Give a description (500 or less words) of the algorithm you use in this project. Include mathematical and computational details about linear regression.

Include details about the theory (origin of the method, derivation, and formulas) and the necessary steps to implement the algorithm using Python.



[Back to top](#Index)

### 3.2 Analysis 

Implement the algorithm on your data according to the examples in Video 7.9 and Video 7.10.

Try to improve the results of your model analysis by including a different number of variables in your code for linear regression. Use what you learned about the correlation between variables when you explored your data to help you select these variables.

Compare the results of at least three different groups of variables. In other words, run a linear regression algorithm on at least three different sets of independent variables. How many variables to include in each set is up to you.

For each step, make sure you include your code. Ensure that your code is commented.





[Back to top](#Index)

### 3.3 Results

 What are your results? Which model performed better? Can you explain why? Include a detailed summary and a description of the metrics used to compute the accuracy of your predictions.

For each step, make sure you include your code. Ensure that your code is commented.



[Back to top](#Index)

### 3.4 Verify Your Model Against Test Data

Now that you have a prediction model, it's time to test your model against test data to confirm its accuracy on new data. The test data is located in the file `jtest.csv` 

What do you observe? Are these results in accordance with what you found earlier? How can you justify this?

[Back to top](#Index)

## Conclusion

The data provided was cleaned: imputation was performed for missing values. The most correlated features were used to train our model and produce the best results. A multiple linear regression was used due to its simplicity and predictive power
The model performed very well to predict data: R² =0.85 for the same data and R² = 0.75 for never seen data. 

To improve the model prediction, a polynomial fit (order 2 or higher) could be implemented to provide a better fit.





[Back to top](#Index
)
## References

- API Reference. “pandas.DataFrame.dropna.” Accessed Nov-28-2022. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html 

- API Reference. “sklearn.linear_model.LinearRegression.” Accessed Nov-28-2022. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html 

- API Reference. “Python | Pandas dataframe.skew().” Accessed Nov-28-2022. https://www.geeksforgeeks.org/python-pandas-dataframe-skew/ 
