**[Machine Learning Course Home Page](https://www.kaggle.com/learn/machine-learning)**

---


This exercise will test your ability to read a data file and understand statistics about the data.

In later exercises, you will apply techniques to filter the data, build a machine learning model, and iteratively improve your model.

The course examples use data from Melbourne. To ensure you can apply these techniques on your own, you will have to apply them to a new dataset (with house prices from Iowa).

The exercises use a "notebook" coding environment.  In case you are unfamiliar with notebooks, we have a [90-second intro video](https://www.youtube.com/watch?v=4C2qMnaIKL4).

# Exercises

Run the following cell to set up code-checking, which will verify your work as you go.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex2 import *
print("Setup Complete")

## Step 1: Loading Data
Read the Iowa data file into a Pandas DataFrame called `home_data`.

In [None]:
import pandas as pd

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

# Fill in the line below to read the file into a variable home_data
home_data = pd.read_csv("../input/home-data-for-ml-course/train.csv")

# Call line below with no argument to check that you've loaded the data correctly
step_1.check()

In [None]:
# Lines below will give you a hint or solution code
#step_1.hint()
#step_1.solution()

In [None]:
home_data.head()

In [None]:
home_data.columns

## Step 2: Review The Data
Use the command you learned to view summary statistics of the data. Then fill in variables to answer the following questions

In [None]:
# Print summary statistics in next line
home_data.describe()

In [None]:
unique = home_data[['Id', 'YearBuilt', 'YrSold']]
unique.describe()

In [None]:
last_sold_yr = home_data['YrSold'].max()                     
last_sold_mo = home_data['MoSold'].max()
latest_built = home_data['YearBuilt'].max()
latest_remodal = home_data['YearRemodAdd'].max()
print(f"Last sold on {last_sold_mo}/{last_sold_yr}")
print(f"Latest build year is {latest_built}")
print(f"Latest remodal year is {latest_remodal}")

In [None]:
home_data['YearBuilt'].describe()

- 25% between 1872-1954 (82 years)
- 25% between 1954-1973 (19 years)
- 25% between 1973-2000 (27 years)
- 25% of total houses were built between 2000-2010 (distributed among 10 years)

In [None]:
home_data.YrSold.describe()

In [None]:
home_data['YearBuilt'].value_counts().sort_index(ascending=False).head()

In [None]:
home_data.groupby('YearBuilt').YearBuilt.value_counts().sort_values(ascending=False)

In [None]:
home_data.YrSold.value_counts()

In [None]:
home_data.groupby('YrSold').YrSold.value_counts()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
plt.figure(figsize=(10,9))
plt.hist(home_data['YearBuilt'],bins=56)
plt.show()

- As per the Histogram for the year built, we can analyse that there are no house data post 2010

In [None]:
plt.figure(figsize=(17,12))
sns.countplot(home_data['YearBuilt'])

In [None]:
sns.countplot(home_data['YrSold'])

In [None]:
# first graph 
minYear = min( min(home_data['YearBuilt']), min(home_data['YrSold']))
maxYear = max( max(home_data['YearBuilt']), max(home_data['YrSold']))
years = range(minYear, maxYear+1)

df = pd.DataFrame({'year': years}, index=years)
df.index.name = 'year'

df['YearBuilt'] = home_data.groupby('YearBuilt').agg({'YearBuilt': 'count'})
df['YrSold'] = home_data.groupby('YrSold').agg({'YrSold': 'count'})
df = df.drop('year', axis=1)
df = df.fillna(0)

ax1 = df.plot(kind='bar', y=['YearBuilt', 'YrSold'], figsize=(20,4), width=0.9)
def format_x(id, pos=None):
    if(years[id] % 10 == 0):
        return str(years[id])
    else:
        return ''
plt.xticks(rotation=45)
ax1.xaxis.set_major_formatter(plt.FuncFormatter(format_x))
ax1.set_title('Number of houses built and sold across the years')
ax2 = ax1.twinx()
ax2.set_ylim(ax1.get_ylim())
for ax in [ax1, ax2]:
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
plt.show()

In [None]:
# second graph

import seaborn as sns; sns.set()

df = home_data.groupby(['YearBuilt', 'YrSold']).agg({'YearBuilt': 'count'})
df = df.unstack(level=1).fillna(0).T
df = df.droplevel(0)

fig, ax = plt.subplots(figsize=(20,4))  
sns.heatmap(df, cmap='Blues', ax=ax)
plt.ylabel('YearSold')
plt.yticks(rotation=0)
plt.show()

In [None]:
# third graph
df = home_data[['YrSold', 'SalePrice']]
fig, ax = plt.subplots(figsize=(20,4))  
sns.boxplot(y="YrSold", x="SalePrice", data=df, orient='h', ax=ax)
plt.show()

In [None]:
Avg_price_by_year_month = pd.DataFrame(home_data.groupby(["YrSold","MoSold"]).SalePrice.mean().round())
Cnt_price_by_year_month = pd.DataFrame(home_data.groupby(["YrSold","MoSold"]).SalePrice.count().round())

price_table = Avg_price_by_year_month.merge(Cnt_price_by_year_month,on = ["YrSold","MoSold"])
price_table = price_table.rename(index = str, columns = ({"SalePrice_x": "Avg_SalePrice", "SalePrice_y": "SaleCount" }))
price_table

In [None]:
price_table.plot.bar(x="Avg_SalePrice",y="SaleCount",legend=None, figsize=(20, 10), color='navy')
plt.xticks(rotation=45)
plt.title("SaleCount by Year and Month", fontsize = 20)
plt.show()

In [None]:
# What is the average lot size (rounded to nearest integer)?
avg_lot_size = round(home_data['LotArea'].mean())

# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = (2020-home_data['YearBuilt'].max())

# Checks your answers
step_2.check()

In [None]:
#step_2.hint()
#step_2.solution()

## Think About Your Data

The newest house in your data isn't that new.  A few potential explanations for this:
1. They haven't built new houses where this data was collected.
1. The data was collected a long time ago. Houses built after the data publication wouldn't show up.

If the reason is explanation #1 above, does that affect your trust in the model you build with this data? What about if it is reason #2?

How could you dig into the data to see which explanation is more plausible?

Check out this **[discussion thread](https://www.kaggle.com/learn-forum/60581)** to see what others think or to add your ideas.

# Keep Going

You are ready for **[Your First Machine Learning Model](https://www.kaggle.com/dansbecker/your-first-machine-learning-model).**


---
**[Machine Learning Course Home Page](https://www.kaggle.com/learn/machine-learning)**

