<img src="https://drive.google.com/uc?id=1-d7H1l1lJ28_sLcd9Vvh_N-yro7CJZcZ" style="Width:1000px">

# Golden Plains Roadside Biodiversity

This is your first data problem! Remember, "Data Problems" are a little bit less directed than the skills problem. They are here to encourage you to use your critical thinking when dealing with data. It is also a better reflection of the type of problems you will encounter during your assessed coursework at the end of the course. Make sure you understand what you have done in the previous exercises, and apply it here. Also, ***get into the habit of maintaining a  clean, working notebook***. This will be a key assessment criteria for your marked coursework later next week, so take this opportunity (and further ones) to learn how to do this. This includes using `markdown` cells for comments and observations, making sure your code can run from top to bottom when using `run all cells` from the menu, and of course, keeping a **clean code** practice. It also also a good idea, once you are done with your work, to put all of your `import` statements at the top of the notebook: this way, it is clear what is imported in the entire notebook and allows you to focus on your more important code below.

Here is a little bit of information on the data you are given. Golden Plains Shire (Australia) is responsible for managing 1834 kilometres of road reserves. Road reserves are not only used for transport but also act as service corridors, in fire prevention, recreation, and occasionally agricultural pursuits. Native vegetation on roadsides is important flora and fauna habitat and landscape character.

In 2014, Golden Plains Shire acquired funding through the Victorian Adaptation and Sustainability Partnership (VASP) to undertake Councils ‘Building Adaptive Capacity on Roadsides’ project. The Project was designed to identify significant environmental assets on roadsides, improve roadside management practices and reduce Council’s risk of potential breaches against Federal and State environmental legislation. 

The council made this <a href='https://data.gov.au/data/dataset/golden-plains-roadside-biodiversity'>dataset available here</a>.<br>
![plain](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Mount_Conner%2C_August_2003.jpg/375px-Mount_Conner%2C_August_2003.jpg)
<br>

🎯 Today, you will work with a simplified version of this real dataset. The dataset contains a number of biodiversity observations including one on tree size (`RCACTreesS`). This exercise consists of the data preparation and modelling techniques you have learnt: our goal is to predict via linear regression the `RCACTreesS` using the available features and obtain a good score.

⚠️ This is a long exercises, which will require you to think about the data. Don't hesitate to plot things - if you need to use algorithm that use a `random_seed` such as `train_test_split` or others, remember to always use the value `42` so your results can be compared to the proposed solution. If you get stuck, ask a TA!

# Part I: Ensuring Generalization and EDA

In this first part, do the following:
1. 👇 Load the data into this notebook as a pandas dataframe named `df`, and display its first 5 rows.
2. Check for and drop duplicates
3. We will use the `RCACTreesS` as our target variable (`y`) and all other columns as our features (`X`).
4. Split the dataset into 80%/20% train/test splits (use a `random_state=42`) to create your `X_train`, `X_test`, `y_train`, `y_test` (see above regarding the `y`).
5. **Using only the X_train**, spend some time exploring the dataset, for instance looking at the different columns it contains, it's data types, any missing values. Check for correlations between features, and draw some plots. At the end of this EDA stage, you should have a good idea of what the data is. Try to keep this notebook cleanly organised, using `Markdown` cells to put comments for yourself (and your TAs) about your observations.

In [None]:
from nbta.utils import download_data
download_data(id='19qi8xMUaamIAX8KcZproR33c2JQcOAul')

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# Part II: Missing values and scaling

Now do the following:
1. Drop features with >30% missing values
2. Imput `RoadWidthM`, `PowerlineD` and `Trees` using the most appropriate strategy <details>
    <summary> 💡 Hint </summary>
    <br>
    ℹ️ Look at the datatype of <code>PowerlineD</code> and the distribution of the data using the <code>.unique()</code> method. Although <code>PowerlineD</code> is a numeric value, it clearly only has discrete distribution: what would be a logical value to impute? The same applies to <code>Trees</code> and <code>RoadWidthM</code> but for a different reason: they are a continuous variable but there is clearly one value that dominates the distribution: it makes sense to assume that the `nan` represent this most frequent value. So you can impute both of these variables at the same time.
</details> 
3. Imput `Locality` and `EVNotes` <details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ Clearly <code>Locality</code> refers to the name of the county or region where the data comes from. We could impute the most frequent locality, but this would induce some errors. In this case, the best strategy is simply to replace the <code>nan</code> by something meaningful such as 'not known'. <code>EVCNotes</code> is somewhat similar: the <code>nan</code> values indicate that no notes exist, so we should replace them by 'no notes'.
</details>
4. Impute `SoilType` and `LandformLS` <details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ These two are tricky. They both are string values, and they both have two classes that are very common. On a real project, a good data scientist will study what those codes means <a href="http://vro.agriculture.vic.gov.au/dpi/vro/vrosite.nsf/pages/landform_land_systems_rees/$FILE/TECH_56%20ch6.pdf"> by refering to the government publication</a>. In an ideal world we would explore different strategies for imputation (we will see this later in the course). However here we need to decide based on little evidence. Because we have no information, and because there is not a clear majority in either soil or landform classes, the best is to impute 'SoilTypeNA' and 'LandFormLSNA' as as a new class.
</details>
5. Imput `CanopyCont` <details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ If you do a <code>value_counts()</code> on <code>CanopyCont</code> you will see that this consists of 4 numerical variables, and 5 categorical variables. It is clear that this column has two different encoding for the same concept: how continuous is the canopy? The easiest is to transform this into a numerical column by doing the following replacements: 'none'=0, 'sparse'=1, 'patchy'=2, 'continous' or 'c' = 3. You probably want to use a python dictionary and an <code>apply()</code> function to do that, and remember to cast your values to an <code>int</code> or a <code>float</code>!
</details>
6. Scale all of your numerical features using an appropriate scaler. Check their distribution before deciding on your scaling strategy! <details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>WidthVarie</code>, & <code>Powerline</code> are clearly binary variable ([0,1]). They should not be scaled, but rater can optionally be encoded using a <code>CategoricalEncoder</code>. Simply leave them as they are. All other numerical features are non-guassian so a `RobustScaler` is probably the most appropriate.
</details>

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

In [None]:
X_train

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('missing_values',
                         dataset = X_train)
result.write()
print(result.check())

### Testing your scaling
Test your code below for scaling before proceeding to ensure all worked well.

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('scaling',
                         dataset = X_train,
                         features = numerical_columns
)

result.write()
print(result.check())

# Part III: Encoding and Modelling

All that is left to do now is deal with categorical data, and then use this to build a simple model.

# Encoding

👇 Investigate the non-numerical features that require encoding, and apply 'One hot encoding'. To ensure that we do not end up with an explosion of feature, we will retain only categorical features with <15 unique values for encoding. 

So your task is the following:

1. Identify programmatically all of the categorical features that have <15 unique categories and require 'One Hot encoding'
2. In the dataframe, replace the original features by their encoded version(s). Make sure to drop the original features, as well as the features with >15 unique categories from `X_train`

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

In [None]:
X_train.shape

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('encoding',
                         dataset = X_train)
result.write()
print(result.check())

# Base Modelling

All we need now is to cross validate  a Linear regression model with our `X_train` and `y_train` using `cv=5`. Save its score under variable name `base_model_score`. However, if you do this you will see that we obtain a very low `r2`. This is because not all of the features we have selected are useful - we will talk more about this in a couple of days. So instead, train your model using only the top features that have a correlation with your `y_train` > 0.05.  <details><summary>💡 Hint </summary>
    <br>
    ℹ️ If you are unsure how to do this, check the documentation for the `corr()` function in pandas. Also, you will need to add group the `y_train` and the `X_train` in the same pandas object to do that.
</details>

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('base_model',
                         score = base_model_score
)

result.write()
print(result.check())

# 🏁