# EDA California housing prices

This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). [*Source*](https://github.com/ageron/handson-ml/tree/master/datasets/housing)

California is one of the 50 state which conforms USA. It's placed on the west coast. Sacramento is the capital of this state but Los Angeles is the most populated city.

Other (very) important citty situated in California, is San Franciso. If you're here probably you know taht this citty is known as be closer to the Silicon Valley, place where the most valuated startups and companies was born. This is a "problem", beacause the ones who lives in SF has incredible incomes (almost 110k USD /year).

The Objective of this notebook is to give an idea about basics stadistics and univariate and bivariate analysis about features provided. Also diving into data in oder to find some nulls, extreme and missing values.

## **What you are going to fin in this notebook?**

**Part 1: Data QA**
* Generals about Data set. (shape, column names and info about data type)
* Information about null values and missing data. 
* Information about outliers.
* Inconsistences
* Conclusions and strategies about Data QA.

**Part 2: Reporting**
* Finding any relations or trends considering multiple features.
* Analize the most valuated house.
* Plot an interactive map

If you like the notebook and think that it helped you, **PLEASE UPVOTE**. It will helps me to keep motivated :)

#### Load packages

In [None]:
import math

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import folium


#### Load Data

In [None]:
df = pd.read_csv('../input/california-housing-prices/housing.csv')

### Part 1: Data QA
First 5 rows of the data set

In [None]:
df.head(5)

Shape of data set

In [None]:
print('This data set has {} tuples and {} columns'.format(df.shape[0],df.shape[1]))

##### Column names

In [None]:
pd.DataFrame(df.columns, columns=['Columnn names'])

##### Info about data type al non-null
The info below show how many null values has each attribute

In [None]:
df.info()

##### Null values
The next two tables shows how completed is the dataset.

In [None]:
df.isnull().sum()

In [None]:
(1-df.isnull().sum()/df.isnull().count())*100

As you can see, only 1 feature has null values.
total_bedrooms is 98.99% completed (it has 207 null values).

##### Plot outliers

In [None]:
def plot_outliers(df,col):
    """  
    The goal of this function is create boxplot of the continuos variables of the dataset recived as a parameter
    
    Args:
        - df: pd.DataFrame.
        - col: column that need to be ploted.
        
    Return: A boxplot made from data pased as parameter.
    
    """
    plt.title(col)
    ax = sns.boxplot(data=df, x=col)
    ax.set(xlabel='')
    plt.show()
    
def plot_hist(df,col):
    """  
    The goal of this function is create histogram of the continuos variables of the dataset recived as a parameter
    
    Args:
        - df: pd.DataFrame.
        - col: column that need to be ploted.
        
    Return: A histogram made from data pased as parameter.
    
    """
    plt.hist(x=df[col],bins=40,color='#D11239')
    plt.show();
    

In [None]:
for col in df.columns:
    if df[col].dtype == 'float64':
        plot_outliers(df,col)
        plot_hist(df,col)

#### Dulpicated values

In [None]:
df.duplicated().sum()

#### Inconsistences to be checked

* housing_median_age >= 0
* total_beedrooms >= 0
* population >= 0
* households >= 0
* median_income >= 0
* median_house_value >= 0

In [None]:
features_inconsistences = ['housing_median_age','total_bedrooms', 'population',
                           'households','median_income','median_house_value']
for feature in features_inconsistences:
    if df[feature].min() <=0:
        print('{} has values below 0.\n'.format(feature))
    else:
        print('{} has no values below 0.\n'.format(feature))

#### Obervations abaout Data QA

* The data set is almost completed
* There are a little outliers in severals features but the most extrange features is median_house_value with the most values near to 206.855, but there are lot of values close to 500k.
* There are 0 tuples duplicates
* There are 0 features with inconsistences
* About the outliers, the aim at Data QA is to show values, if the model required, feature engineering will do in the columns with outliers

### Part 2: Reporting

The next block contains a loop which helps to find some stadistics about the continuos variables.

In [None]:
desv_std = []
for col in df.select_dtypes(["float64"]).columns:
    desv_std.append(
        {
            "Feature": col,
            "DesvStd": df[col].std(),
            "Mean": df[col].mean(),
            "Max": int(df[col].max()),
            "Min": df[col].min(),
            "Q_1": df[col].quantile(0.25),
            "Q_3": df[col].quantile(0.75),
            "Dif Max-Q_3": int(
                df[col].max() - df[col].quantile(0.75)
            ),
        }
    )

df_desv_std = (
    pd.DataFrame(desv_std)
    .sort_values(by="DesvStd", ascending=False)
    .reset_index(drop=True)
)
df_desv_std

- **median_house_value:** Has the highest standard deviation.


#### Heatmap

The corralation is a metric about how much related are two features. Correlation is a value between -1 and 1. Closer to 1 means very strong and positive (direct) relation (example: Mayor power on cars means to much use of fuel). On the other hand, closer to -1 mean a very strong and negative (inversely)relaation (example: Spend more hours at work, mean lees hours for sleep). At the end, if value is closer to 0, there is no relation between features.

Below you can see a plot called heatmap, which dives into corelation about the loaded dataset

In [None]:
import seaborn as sns

f, ax = plt.subplots(figsize=(10, 8))
corr = df.corr()
sns.heatmap(corr,
            mask=np.zeros_like(corr, dtype=np.bool),
            cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True,
            ax=ax);

#### Interpreting The Heatmap
* total_rooms, total_debrooms has a strong related with population and households becaouse. It is expected, because more people, in general, means more places to live.

* median_income is related with median_house_value. Also it is expected (the richer the population, the higher prices).

#### Interactive Map

The aim of the next plot is to show the longitude and latitude in a map.

**NOTE:** the plot has only the first 1000 rows as a example.

In [None]:
m = folium.Map(location=[20,0], tiles="OpenStreetMap", zoom_start=2)
for i in range(len(df.head(1000))):
    folium.Marker(
      location=[df.iloc[i]['latitude'], df.iloc[i]['longitude']],
   ).add_to(m)

sw = df[['latitude', 'longitude']].min().values.tolist()
ne = df[['latitude', 'longitude']].max().values.tolist()

m.fit_bounds([sw, ne]) 
m

### Next Steps

The next notebook will include some feature engineering and a ML model.