# Data Visualization Problem Set

<span style="color:red">0 / 0 points</span>.

In this problem set, you will be making different plots from the same `housing` dataset you used in the Data Visualization chapter. For all of the questions where  ask you to generate a plot, there will be a list of 'requirements' that your plot must include, so please pay attention to the description of the plot. 

This problem set also requires you to perform very specific subsets and grouped tables of the `housing` dataframe. Thus, you will need to do a fair amount of recalling from pandas chapters one and two: Mainly subsetting by multiple string values, grouping with `groupby()`, choosing value columns with `aggregate()`, etc... 

After every question in the problem set, please remember to create a new code cell for your answer/code. 

In [1]:
# This code cell will be in every one of our chapters in Jupyter Notebook
# The function allows you to see every line of output when the code has multiple lines
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# load packages
import pandas as pd
import matplotlib.pyplot as plt

## Programming 

1. Read in the `housing.csv` dataset that you used in the Data Visualization chapter, and name the object `housing`. 

2. Assign a new dataframe named `df` that takes the following data from the `housing` dataframe:
   - Subset `housing` for just two states: 'Georgia' and 'Michigan'.
   - From that subset, group the data by state.
   - From that grouping, aggregate all of the `BUILT_0000` variables, from `BUILT_2020` to `BUILT_1939`. The aggregate function should be 'sum'.
You should make this dataframe in a single line of code.
3. In the same code cell, display `df` and display a [transposed](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html) `df`.

4. Create a bar plot from the transposed `df`. Make sure that the plot has the following features:
   - The legend _only_ shows the names of the two states. In other words, without the variable name `STATE` in the legend.
   - Change the Y-axis numeric notation from scientific to plain
   - Add a title describing this plot.
   - Rotate the x-axis text 45 degrees clockwise.
   - Label both y and x axes.
   - Change the values of the x-axis ticks to just the decades (i.e.: 2020, 2010, 2000...)
   - Extra 0.5 point: specify `color=('crimson','navy')`. We did not teach this argument in the chapter. Find the right place for this line such that the bars for Georgia are crimson, and the bars for Michigan are navy.

5. The previous plot's dates on the x-axis are in reverse order. Fix them so the values go in chronological order from left to right. _Hint: It'll be easier to reverse the decade order in the dataframe than with pyplot._

6. Make another bar plot on a grouped table for the ten states with the lowest average values of 'median rent'. This time you do not need to assign a new object. In one line create a new data table that has the following characteristics:
   - The value of `YEAR` is greater than 2015.
   - Grouped by the variable `STATE`.
   - The aggregate function takes the mean of `MED_RENT`.
   - Sort the values in ascending order, and pick the first ten observations.

    In the very same line of code, attach the `.plot()` function to make a bar chart. Then add the following features to the plot:
   - The bars should be green.
   - Add a title that says 'Ten Lowest Median Rents in the USA - 2016-2022 Average'.
   - Remove the legend entirely.
   - Label the y-axis 'Rent in dollars'. Remove the x-axis label.

7. Visualize the trends for median home value `MED_VALUE` from 2015 to 2022 in Arkansas, Ohio, Puerto Rico, North Dakota, and South Dakota. To do this, create a new `df2` dataframe object that takes the following data from `housing`:
   - Filters for values of `STATE` containing the five states mentioned, __and__ `YEAR` is greater than 2014.
   - Group the subset by `YEAR` and `STATE`.
   - Aggregate the mean value of `MED_VALUE`.
  
    This should be in a single line of code. _Do not forget to unstack the table._

    Make a line plot from `df2` that has the following visual features:
   - Add a descriptive title to this plot.
   - The legend box should contain only state names.
   - Label the y-axis 'Median home value in dollars'. Remove the x-axis label.
   - Make the line style dotted, and the line width 5.
   - Use the 'Accent' colormap.
   - Annotate the plot with an arrow pointing at the top of the 'Iowa' line where it intersects with the year '2020'. Write 'No data collected 2020'.

8. Create a scatter plots to investigate the relationship between total housing units `TOT_UNITS` available and the number of vacant units `VACANT`. Subset the `housing` dataframe for 2022 an plot a scatterplot with the following parameters:
    - `TOT_UNITS` should be $x$, and `VACANT` should be $y$
    - Title your plot 'Total Housing Units and Vacant Units- 2022'
    - Scale the plot axes to zoom in past the outlier points. Use the following coordinates: -100000,200000 on the x-axis; -10000, 200000 on the y-axis.
    - Change the x-axis values from scientific to plain, and rotate them 45 degrees.
    - Make the dot colors red with an transparency of 50%.
    - Label your x and y-axis 'Total housing units' and 'Total vacant units'. 
    
9. In the same cell, repeat the plot but for the year 2010. Change the title accordingly, and change the dot color to blue.

10. Create a custom function named `custom_mv_line_plot` that re-creates the plot from question seven. The function should make the same plot as question seven, with the same parameters but the user can provide different states and starting years. The function should have three arguments:
    - An argument for the `housing` data, named `data`
    - An argument for the list of states the user can provide to the function, named `states`
    - An argument for the year that the user wishes to start tracking median home values, names `starting`

    The function should subset `data` by the `states` and `starting` year. _Hint:_ look for a lesson on joining strings in a previous chapter if you get stuck on what to place in `.str.contains()`

    Do not annotate the plot for the year when no data was collected. In the end, you should call `custom_mv_line_plot` twice with the following two sets of arguments:
    - `custom_mv_line_plot(data=housing, states=['California', 'Arizona', 'Washington'], starting=2014)`
    - `custom_mv_line_plot(data=housing, states=['Wyoming', 'Hawaii', 'Florida', 'Vermont', 'Idaho'], starting=2010)`

## Interpretation Questions

11. In question four, the plot show values for `BUILT_1939` far exceeding any other decade. In your own words explain why that is. 

12. Explain why the dots in the scatterplots from question 9 cluster so densely towards the left, and why there are so few observations in the right side of the chart? What does this mean, regarding the density of observations, the number of housing units, and what each dot actually represents?

13. It appears that there is a close relationship between total housing units and vacant units, such that the cluster of observations is in a relatively dense diagonal line pointing up and to the right. How do you interpret this relationship, in your own words? Is there an obvious explanation?

## Debugging

14. The code below has bugs in the code that keep it from rendering a box plot. Make a new code cell and fix the code so the plot displays correctly.

In [None]:
housing[housing['STATE'].str.contains('Puerto Rico|Ohio|North Dakota|Arkansas|South Dakota')].plot(kind='box', column='tot_units', by='state')

15. The code below should display a scatterplot of median home values by median rent for the year 2022. There are several bugs in the code that you must fix in a new code cell. 

In [None]:
housing[housing['YEAR']=2022].plot(kind='scater', x='MED_VALUE', y='MED_RENT', color='red', alpha=0.0, rot=45)
plt.title('Median Housing Value by Median Rent - 2022')
plt.ticklabel_format(style='plain', axis='x')
plt.xlabel('Median House Value')
plt.ylabel('Median Rent');