<a href="https://colab.research.google.com/github/ragavkumar/Python_Viz_Challenge/blob/main/Python_Visualizations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What makes buying a home in California so expensive?
- *Throughout* this notebook we will seek to answer that question using **exploratory** data analysis techniques and visualizations in Python.
- Anytime you see a line surrounded by triple asterisks, `***LIKE THIS***`, that is a line of code that you will need to replace or edit.
- Have fun and good luck coding!

> To execute a line or block of code, simply click the "Play" button on the left side or use the keyboard shortcut "Shift + Enter"
> When that code block has actually been executed, the blank brackets will change to have a number inside of them.

In [None]:
x = 'Hello, World!'
print(x)

Hello, World!


___
### Import the California Housing dataset.

Keep in mind:
*   One block of homes per row
*   Values are from 1990, not 2019!




In [1]:
# Libraries we need
import os
import requests

def reqFile(url, file_path):
    # Retrieve the library xlsx file and save it in the datasets path
    resp = requests.get(url)
    output = open(file_path, 'wb')
    output.write(resp.content)
    output.close()


In [2]:
# Source of data
url = 'https://github.com/ageron/handson-ml/raw/master/datasets/housing/housing.csv'

# Define a name for the file to be downloaded
file_name = 'housing.csv'

# Get the current working directory to get the full path to 'file_name'
file_path = os.path.join(os.getcwd(), file_name)

# Request the file
reqFile(url, file_path)

In [3]:
import pandas as pd

housing = pd.read_csv(file_path)

### Quick EDA

Use the `.head()` function to look at the first 5 rows of data entries.


In [4]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


Use `.shape` to return the size of your data set (rows, columns).

In [5]:
housing.shape

(20640, 10)

Use the `.describe()` function to see summary statisitcs on the numerical columns in the data set.

In [6]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


___
### Let's start using visualizations!

# Import packages that we need to explore and visualize our data.

In [None]:
# Import the appropriate packages:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the default style:
plt.style.use('seaborn-darkgrid')

#### Visualizations for Ourselves
Let's start by plotting some basic graphs to answer questions about data.
We don't need to worry too much about making these aesthetically pleasing.

- What is the distribution of house values in our data set?
> To answer this, we'll use a **histogram**:

In [None]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


- I know I said we don't need to worry _too much_ about aesthetics, but if you saw the above histogram without the associated code, you wouldn't know what it's conveying.
> Let's add some labels to fix that:

# What Is The Distribution Of House Values?

In [None]:
# First we create our plot
***LIVE CODE***

# We can add labels
plt.title('Distribution of Median House Values')
plt.xlabel('Median House Value')
plt.ylabel('Number of Houses') 

# We can also export our vis to a file (supports 'png', 'svg', 'pdf', etc)
plt.savefig('dist_house_values.svg', format = 'svg', transparent=True)

- What about using our `ocean_proximity` column to quantify our house locations? How close to (or far from) the ocean are all of our houses located?
> We'll use `value_counts` and a **bar chart** to answer this:

In [None]:
housing.head()

In [None]:
# Generating the values we are using for the following plot:
housing['ocean_proximity'].value_counts()

# How Close To The Ocean Are The Houses?

In [None]:
# Bar chart for those values here:
***LIVE CODE***

# Labels
plt.title('Number of Houses By Ocean Proximity') 
plt.xlabel('Ocean Proximity') 
plt.ylabel('Number of Houses');

- While plotting the value counts for `ocean_proximity` is nice for some initial EDA, it's not necessarily helping us answer more about our big overall question. Maybe we continue to build off of it and ask the following: what is the average median house value per each category in `ocean_proximity`?
> We'll use pandas `groupby` and a **bar chart** to answer this:

In [None]:
housing.head()

# How Does Ocean Proximity Relate To House Value?

In [None]:
# Generating the values for the following plot using groupby:
ocean_prox_house_val = housing.groupby(
    'ocean_proximity')['median_house_value'].mean().sort_values(ascending=False)

ocean_prox_house_val

In [None]:
# Use our ocean_prox_house_val variable to plot those values as a bar chart:
ocean_prox_house_val.plot(***LIVE CODE***)

# Labels
plt.title('House Value by Ocean Proximity') 
plt.xlabel('Ocean Proximity') 
plt.ylabel('Average Median House Value'); 

> _From this chart, it seems like houses located further inland are associated with a decrease in average median house value._

In [None]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


# Challenge 1

In [None]:
# CHALLENGE 1
# Use groupby and create another bar chart to show how a different continuous variable is related to ocean proximity:
# Don't forget to replace any relevant labels and titles!
***LIVE CODE*** = housing.groupby(
    'ocean_proximity')['***LIVE CODE***'].mean().sort_values(ascending=False)

***LIVE CODE***.plot(kind = 'bar', color = 'maroon')

# Labels
plt.title('***LIVE CODE***') 
plt.xlabel('***LIVE CODE***') 
plt.ylabel('***LIVE CODE***'); 

# Is There A Relationship Between Income And House Value?


- Maybe we want to know if other aspects of our data are related in any meaningful way.

- We might suspect that more income could likely mean a nicer home, so we could ask the following: are `median_income` and `median_house_value` positively correlated?

- A **scatter plot** will help illuminate that for us:


In [None]:
housing.head()

In [None]:
# Median income vs house value scatterplot
plt.scatter(***LIVE CODE***)
plt.title('Median Income $10k by Median House Value') 
plt.xlabel('Median Income $10k')
plt.ylabel('Median House Value');

> _It looks like there is a positive correlation between income and house value._

# Challenge 2

In [None]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [None]:
# CHALLENGE 2
# Use a scatter plot to show a different relationship between two continuous variables:
# Don't forget to replace any relevant labels and titles!
plt.scatter(housing['***LIVE CODE***'], housing['***LIVE CODE***'])
plt.title('***LIVE CODE***') 
plt.xlabel('***LIVE CODE***')
plt.ylabel('***LIVE CODE***');

# What are the Correlations Between All Continuous Variables?

- What if we don't want to look at individual scatterplots one at a time? Is there a way to look at multiple correlations at once?
> Yes! Pandas has a very helpful `.corr()` function that generates correlation values for all continuous variables with each other.

In [None]:
# Generating the values to visualize:
# (You can also check for how strong the correlation was for your scatter plot here!)
housing.corr()

> We can visualize the results of this correlation matrix using Seaborn's **heatmap** visualization:

In [None]:
# Visualizing our correlation matrix with a heatmap:
plt.figure(figsize=(8,8))
sns.heatmap(***LIVE CODE***,
            linewidths = 0.25,
            square = True,
            cmap = 'seismic',
            linecolor = 'black',
            annot= True)

plt.title('Correlation Matrix', fontsize = 30);

# Final Visualization

We've answered a lot of questions about our data, and have generated some very useful visualizations to help answer those questions. However, both aesthetically and functionally, the ones that we've created have fallen more into the "visualizations for ourselves" bucket.

Now that we have a deeper understanding of our dataset, let's switch gears and create one final visualization that is presentation-ready and aims to answer our big question that we started with: **What makes buying a home in California expensive?** We will want this one to display multiple aspects of our dataset at once and embody the use of visualization to tell a story about our data.

- Since we have `longitude` and `latitude` data, we can use a **scatter plot** to create a map of our data and then continually experiment with new parameters to get what we want:
> First change the size of the figure and then we'll set the `alpha` (or transparency) of our points to 40%.

In [None]:
# Plotting a scatter plot of longitude and latitude:

housing.plot(kind = 'scatter',
             x = 'longitude',
             y = 'latitude',
             figsize = (7,7),
             alpha = .4);

- We know that we'll want to incorporate median income information into our viz, so let's do that here:
> Do this by changing the `s` (size) of each point proportional to the median income of the associated block and then adding a label so our audience knows what it's conveying.

In [None]:
# Changing our point size:

housing.plot(kind = 'scatter',
             x = 'longitude',
             y = 'latitude',
             figsize = (7,7),
             alpha = 0.4,
             s = housing['median_income']*5,
             label = 'median income');

- In our last piece of data artistry, we'll incorporate the actual median house values:
> Do this by changing the `c` (color) of each point proportional to the median house value of the associated block along the `cmap` (colormap) spectrum of "jet" and then adding a `colorbar` so our audience knows the range of those values. 

Now we have a _BEAUTIFUL_ piece of storytelling with data.

In [None]:
# Building out our final viz:
housing.plot(kind = 'scatter',
             x = 'longitude',
             y = 'latitude',
             figsize = (7,7),
             alpha = 0.4,
             s = housing['median_income']*5,
             label = 'median income',
             c = ***LIVE CODE***,
             cmap = ***LIVE CODE***,
             colorbar = ***LIVE CODE***)

plt.title('Correlation Matrix');

#### To wrap it all up, we now have a plot that is showing that housing prices increase in areas of more concentrated wealth that are also closest to the coast and high density urban areas like San Francisco and Los Angeles.

#### Challenge
For added practice and to improve this last visualization, think about what other factors you could plot to continue to answer the big question, "What makes buying a home in California so expensive?".  

Suggestions of things to change/add:
- Change the column used to show the size (s) of the points (you might have to multiply or divide by a different value, and make sure to change the label too)
- Change the [color map](https://matplotlib.org/tutorials/colors/colormaps.html) to better convey the information
- Change the figure size (which number corresponds to width?)
- Change the alpha (values between 0 and 1)
- Look through the [matplotlib scatter](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html) docs to see what other parameters you can add/change to improve the plot (can you remove the axes?)

# Keep Learning with Thinkful
If you enjoyed today's session and want to take a deeper dive into many of the topics that we covered today like Pandas, SQL, predictive modeling, visualizing your data, and so much more, we'd love to have you join us again!
- Check out more of our webinars at [Thinkful Webinars](https://www.thinkful.com/webinars/)
- Learn more about the [Data Science Flex Course](https://www.thinkful.com/bootcamp/data-science/flexible/)