## Week 5 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class.

Before you attempt any of these activities, make sure to watch the video lectures for this week.

### Using git
Let's start with some practice using git. 

I set up a class repository. You should all have full access.

Enter GitHub Desktop, go to File > Clone Repository. Then clone `UCLALuskinDataScience/git-practice` to somewhere on your computer.

Try adding a new file first. Create a text file on your computer and save it in the `git-practice` folder. Go to GitHub Desktop. You should see your new file.
* Add a commit message
* Click on "commit to main"
* Fetch the origin (in case anyone has updated the repository in the meantime)
* Click on "push origin"

You should now see your file in the [cloud repository](https://github.com/UCLALuskinDataScience/git-practice).

If you get an error, "You have divergent branches and need to specify how to reconcile them", type this into the Terminal or Command Prompt. ([See background here.](https://github.com/desktop/desktop/issues/14431))

`git config --global pull.rebase false`

Fetch the origin again. Now try editing your neighbor's file, or the code that's in there already. Commit, and push it back to the cloud. 

What happens if you both try and edit it at once?

### ADUs and neighborhood-level predictors
Let's continue with the example of ADUs from the video lecture.

We'll add a broader set of predictors at the neighborhood (census tract) level, and see if that improves our predictive performance.

First, let's load in the DataFrame that we saved during the lecture. If you ran the code for the lecture (Part 1 - Data preparation), you should be able to load it in as follows.

In [1]:
import pandas as pd
import geopandas as gpd


parcels = pd.read_pickle('../Lectures/joined_permits.pandas')

# convert to a geodataframe. Same code as from video lecture
parcels = gpd.GeoDataFrame(parcels, 
                    geometry = gpd.points_from_xy(
                        parcels.CENTER_LON, 
                        parcels.CENTER_LAT, crs='EPSG:4326'))

# check it looks OK
parcels.head()

FileNotFoundError: [Errno 2] No such file or directory: '../Lectures/joined_permits.pandas'

Now let's load in our tract-level data from EnviroScreen. Again, this is the same code as we've used before.

In [2]:
enviroscreen = gpd.read_file('../Lectures/data/CalEnviroScreen/CES4 Final Shapefile.shp')

Now we can add the tract-level attributes to each parcel. (Important: we want to keep the dataset at the parcel level, rather than grouping by tract and getting counts as we did before.)

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Do a spatial join to add the EnviroScreen attributes for the relevant census tract to each parcel.
</div>

In [None]:
# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Estimate a random forests model that takes advantage of the new columns you just joined. Create a new variable, <strong>y_pred</strong> with your predicted values.

*Hints:*
- You'll first need to choose which variables you want. Focus on the numeric variables for now - don't worry about creating dummies from the string variables
- Then split your dataset into training and testing portions
- Then estimate your model
- Then make your predictions on the testing dataset

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay

# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Plot a confusion matrix, plus any other measures of fit that you think would be useful.

In [None]:
# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Which variables are most important to your predictions?

In [None]:
# your code here

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> The model that I estimated was pretty counterintuitive in terms of the feature importances. (Maybe you got different results.) Think about how to explain your findings, and what you might do to investigate further?
</div>

Add some of your thoughts here.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Use a neural net to estimate the probability of adding an ADU. How do your predictions compare to the random forests?
</div>

*Hint*: Remember to standardize your variables first. Since you don't have any binary (dummy) variables, you can standardize all of your x variables.

In [None]:
from sklearn import preprocessing
from sklearn.neural_network import MLPClassifier

# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Experiment with some of the hyperparameters (e.g. layer sizes). How do these affect your results?
</div>

In [None]:
# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Map your predictions. You could create a census-tract level variable with the predicted probability of a parcel having an ADU. You could also map the predicted vs actual ADU numbers.</div>

<div class="alert alert-block alert-info">
<h3>What you should have learned</h3>
<ul>
  <li>Gain more practice with spatial joins</li>
  <li>Understand how to estimate a random forests model.</li>
  <li>Understand how to interpret the results of machine learning classification models.</li>
</ul>
</div>