## Introduction

Welcome to my data visualization report.
This is an investigate presentation of data analysis and data visualization using Python programming language.  I would like to use the dataset from Kaggle.com to present my analysis report. This analysis process is divided into seven parts to complete, which are:

* **1. Dataset Background**
* **2. Importing Dataset**
* **3. Analysing Dataset**
* **4. Amending NULL Values**
* **5. Data Visualization**
* **6. Finding Outliers**
* **7. Geographical Map**


## Purpose

Analyse and present data in various graphs using Python and relevant libraies

## Acknowledge

This data was initially featured in the following paper: Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.

and I encountered it in 'Hands-On Machine learning with Scikit-Learn and TensorFlow' by Aurélien Géron. Aurélien Géron wrote: This dataset is a modified version of the California Housing dataset available from: Luís Torgo's page (University of Porto)

## Dataset Background

**California Housing Prices**
The data contains information from the 1990 California census. The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. It does provide an accessible introductory dataset for the basics of machine learning. 


About this file
1. longitude: longitude location
2. latitude: latitude location
3. housingMedianAge: Median age of a house **within a block**
4. totalRooms: Total number of rooms **within a block**
5. totalBedrooms: Total number of bedrooms **within a block**
6. population: Total number of people residing **within a block**
7. households: Total number of households, a group of people residing within a home unit, **for a block**
8. medianIncome: Median income for households within a block of houses **(measured in tens of thousands of US Dollars)**
9. medianHouseValue: Median house value for households within a block **(measured in US Dollars)**
10. oceanProximity: Location of the house away from ocean/sea

*Note: Califorrnia Population in 1990 is 29.95 million*

## Importing Dataset

In [None]:
# Importing libraries

import numpy as np # data processing
import matplotlib.pyplot as plt # Visualization
import seaborn as sns # Visualization
import folium # Visualization Map
from folium.plugins import HeatMap # Visualization Map
from mpl_toolkits.basemap import Basemap # Visualization Map


In [None]:
# Importing Data
import pandas as pd
house = pd.read_csv("../input/california-housing-prices/housing.csv")

## Analysing Dataset

In [None]:
# Basic information about dataset
print(house.describe())

In [None]:
# Find rows and columns from this dataset
# Display column info
print("The number of rows and columns: " + str(house.shape))
print('\nThe columns are: \n')
[print(i,end='.\t\n') for i in house.columns]

In [None]:
# Display the first five rows in the dataset
print(house.head())

In [None]:
# Display the last five rows in the dataset
print(house.tail())

In [None]:
# Show all information about this dataset, which are columns, data type, memory usage and so on.
print(house.info())

## Amending NULL Values

In [None]:
# Check any NULL value in the dataset
print(house.isnull().sum())

In [None]:
# Display NULL data value using heatmap
plt.figure(figsize=(15,8))
plt.title('Missing data')
plt.ylabel("Count")
house.isnull().sum().plot(kind= 'bar' )

In [None]:
# Fix NUll value
median = house["total_bedrooms"].median()
house["total_bedrooms"].fillna(median, inplace=True)
print(house.isnull().sum())

In [None]:
# Count dataset
print(house.count())

## Data Visualization

In [None]:
# histogram of each columns data
house.hist(bins=80, figsize=(15, 15))

In [None]:
# Pie chart to show ocean proximity value
plt.figure(figsize=(8, 8))
plt.title("Pie chart on ocean proximity value")
house['ocean_proximity'].value_counts().plot(kind = 'pie',colormap = 'jet')


# amount on ocean_proximity categories
# X axis： house amount
# Y axis: ocean proximity
plt.figure(figsize=(12, 8))
sns.countplot(data=house, x="ocean_proximity")
plt.xlabel("Ocean Proximity")
plt.ylabel("House Amount")
plt.title("Number of houses on ocean proximity categories")

In [None]:
# Density of Median House Value on Ocean Proximity
plt.figure(figsize=(12, 8))
sns.stripplot(data=house, x="ocean_proximity", y="median_house_value",
              jitter=0.3)
plt.xlabel("Ocean Proximity")
plt.ylabel("Median House Value")
plt.title("Density of Median House Value on Ocean Proximity")

In [None]:
# house value on ocean_proximity categories
# X axis: Ocean Proximity
# Y axis: Median House Value
plt.figure(figsize=(12, 8))
sns.boxplot(data=house, x="ocean_proximity", y="median_house_value",
            palette="viridis")
plt.xlabel("Ocean Proximity")
plt.ylabel("Median House Value")
plt.title("House value on Ocean Proximity Categories")

In [None]:
#heatmap using seaborn
#set the context for plotting 
sns.set(context="paper",font="monospace")
housing_corr_matrix = house.corr()
#set the matplotlib figure
fig, axe = plt.subplots(figsize=(12,8))
#Generate color palettes 
cmap = sns.diverging_palette(220,10,center = "light", as_cmap=True)
#draw the heatmap
plt.title("Correlation between features")
sns.heatmap(housing_corr_matrix,vmax=1,square =True, cmap=cmap,annot=True );

print('\nAs shown in the Heatmap there is a strong correlation between the following features:\n')
print('- households')
print('- total_bedrooms')
print('- total_rooms')
print('- population')

print('\n')

print('The number of bedrooms in a district is obviously correlated\nwith the number of rooms in the district, the same is true for the number of families\nand the total population living in a district, finally number  of rooms is correlated\nwith the people\n')


## Finding Outliers

In [None]:
#Finding Outliers
plt.figure(figsize=(15,5))
sns.boxplot(x=house['housing_median_age'])
plt.figure()
plt.figure(figsize=(15,5))
sns.boxplot(x=house['median_house_value'])

In [None]:
# histogram to show outliers on "median_house_value" columns
plt.figure(figsize=(12,8))
plt
plt.xlabel("Median House Value")
plt.ylabel("House Amount")
house['median_house_value'].hist(bins=100)

In [None]:
# Reomve outliers
house = house.loc[house['median_house_value']<500001,:]
plt.figure(figsize=(12,8))
plt.xlabel("Median House Value")
plt.ylabel("House Amount")
house['median_house_value'].hist(bins=100)

## Geographical Map

In [None]:
# Find the location on map
m = Basemap(projection='mill',llcrnrlat=25,urcrnrlat=49.5,\
            llcrnrlon=-140,urcrnrlon=-50,resolution='l')

plt.figure(figsize=(25,17))
m.drawcountries() 
m.drawstates()  
m.drawcoastlines()
x,y = m(-119.4179,36.7783)
m.plot(x, y, 'ro', markersize=20, alpha=.8) 
m.bluemarble() 
m.drawmapboundary(color = '#FFFFFF')

In [None]:
# Join Geographical Chart and histogram to show population density
plt.figure(figsize=(15,10))
sns.jointplot(x=house.latitude.values,y=house.longitude.values,size=10)
plt.ylabel("longitude")
plt.xlabel("latitude")

In [None]:
# Geographical Chart shows median house value
house.plot(kind="scatter", x='longitude', y='latitude', figsize=(15, 10),
           c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True, )
plt.title("Geographical chart Shows Median House Value")

In [None]:
# Live Heatmap to show California state
map = folium.Map(location=[36.7783,-119.4179],
                    zoom_start = 6, min_zoom=5) 

df = house[['latitude', 'longitude']]
data = [[row['latitude'],row['longitude']] for index, row in df.iterrows()]
HeatMap(data, radius=10).add_to(map)
map
