# Lab 2: Normalization
For this lab, we are going to practice normalization on the California Housing Price dataset, which we will load using scikit learn. 

## 1.Create Dataset

In [27]:
import pandas as pd
from sklearn.datasets import fetch_california_housing # pull out california housing dataset

# create object to download dataset
cali_data = fetch_california_housing()

# view dataset info
print(cali_data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

We can prepare this dataset by joining its data and target attributes into a a dataframe, using the feature_names attribute as column names: 

In [58]:
# combine features and column names
cali_df = pd.DataFrame(cali_data.data, 
                      columns=cali_data.feature_names)

# add housing prices a
prices = cali_data.target
cali_df["Price"] = prices

# convert housing prices from log scale
import math
cali_df["Price"] = round(cali_df.Price.apply(math.exp) * 10000, 2)

cali_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,923882.68
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,360533.58
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,338182.3
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,303561.76
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,306306.15


## 2. Apply Standardization to the Dataframe 
Apply standardization to the dataframe using sklearn's standardized scaler. 

In [None]:
# your code goes here, save results as new dataframe called cali_standardized

## 3. Apply min-max scaling to the Dataframe
Apply sklearn's min-max scaler to cali_df

In [None]:
# your code goes here, save results as new dataframe called cali_minmax

## 4. Plot the results
Calculate summary stats and create bar plots to examine the effects of normalization on the results of both standardization and min-max scaling on the MedInc (median income) column:

In [None]:
%matplotlib inline

In [None]:
# bar plot of the original cali_df.MedInc

In [None]:
# cali_df.MedInc summary stats

In [None]:
# bar plot of the standardized.MedInc data frame

In [None]:
# standardized.MedInc summary stats

In [None]:
# bar plot of the min-max.MedInc scaled df

In [None]:
# min-max.MedInc summary stats

## 5. Interpret the results
Which method do you think works best for this dataset? Why? Are there any columns that should not be standardized? 