**[Machine Learning Course Home Page](https://www.kaggle.com/learn/machine-learning)**

---


This exercise will test your ability to read a data file and understand statistics about the data.

In later exercises, you will apply techniques to filter the data, build a machine learning model, and iteratively improve your model.

The course examples use data from Melbourne. To ensure you can apply these techniques on your own, you will have to apply them to a new dataset (with house prices from Iowa).

The exercises use a "notebook" coding environment.  In case you are unfamiliar with notebooks, we have a [90-second intro video](https://www.youtube.com/watch?v=4C2qMnaIKL4).

# Exercises

Run the following cell to set up code-checking, which will verify your work as you go.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex2 import *
print("Setup Complete")

## Step 1: Loading Data
Read the Iowa data file into a Pandas DataFrame called `home_data`.

In [None]:
import pandas as pd

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

# Fill in the line below to read the file into a variable home_data
home_data = pd.read_csv(iowa_file_path)

# Call line below with no argument to check that you've loaded the data correctly
step_1.check()

In [None]:
# Lines below will give you a hint or solution code
#step_1.hint()
#step_1.solution()

## Step 2: Review The Data
Use the command you learned to view summary statistics of the data. Then fill in variables to answer the following questions

In [None]:
# Print summary statistics in next line
home_data.describe()

In [None]:
# What is the average lot size (rounded to nearest integer)?
avg_lot_size = round(home_data.LotArea.mean())

# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = 2021 - max(home_data.YearBuilt)

# Checks your answers
step_2.check()

In [None]:
#step_2.hint()
#step_2.solution()

## Think About Your Data

The newest house in your data isn't that new.  A few potential explanations for this:
1. They haven't built new houses where this data was collected.
1. The data was collected a long time ago. Houses built after the data publication wouldn't show up.

If the reason is explanation #1 above, does that affect your trust in the model you build with this data? What about if it is reason #2?

How could you dig into the data to see which explanation is more plausible?

Check out this **[discussion thread](https://www.kaggle.com/learn-forum/60581)** to see what others think or to add your ideas.

# Keep Going

You are ready for **[Your First Machine Learning Model](https://www.kaggle.com/dansbecker/your-first-machine-learning-model).**


# Some More Exploring

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
home_data.head()

## The Buyer Trend

Let us see if the buyer is OK with old houses or are new houses selling like hot cakes.

In [None]:
home_data["sold_minus_built"] = home_data["YrSold"] - home_data["YearBuilt"]
sns.displot(home_data, x="sold_minus_built")

Well, seems new is the winner! The trend is towards buying new houses which are less than 15 years old.

## Trend Across Years 

Interesting to see that the houses were bought only from 2006! So essentially we have just 4 years of data. 
Also, the number of houses sold in 2010 is ridiculously less.

In [None]:
sns.set_theme(style="darkgrid")
sns.displot(
    home_data, x="sold_minus_built", col="YrSold",
    height=3, facet_kws=dict(margin_titles=True),
)

## The Buying Period  

It can be noticed that the buying is very active during the second and the thrid quarters. There is a big lack of data in 2010 buying period.

In [None]:
home_data["Quarter"] = home_data["MoSold"].apply(lambda x:(x-1)//3 + 1)
df = home_data.groupby(["YrSold","Quarter"]).size()\
              .reset_index()\
              .rename(columns={0:'NumSold'})

sns.factorplot(data=df, x='YrSold', y='NumSold', hue='Quarter', kind='bar')

## Is the data helpful 

- Difficult to judge the current prices with this data. Especially given the deficit of information in 2010. 
- However, observing the Inter Quartile Range of the price range, there does not seem to be a huge difference accross years. Therefore, this data can ive a good starting point to predict prices BUT new data is always more relevant.

In [None]:
sns.set_theme(style="whitegrid")
ax = sns.boxplot(x="YrSold", y="SalePrice", data=home_data)

---
**[Machine Learning Course Home Page](https://www.kaggle.com/learn/machine-learning)**

