# Problem Set 1 - Python Intro & Data Prep

The main goal of this problem set is to introduce you to working with data in pandas DataFrames and preparing that data for use with a machine learning model. We will also look briefly at the scikit-learn interface and documentation. A core objective of this course is for you to become comfortable reading and understanding python package documentation, so read the hints carefully and use the provided links to find and learn how to use the built-in functions available in pandas and scikit-learn.

## Importing Packages

The major packages that we will work with in this course are:   

**pandas**: loading, managing, and manipulating data     
*   User Guide: https://pandas.pydata.org/docs/user_guide/index.html   
*   Documentation: https://pandas.pydata.org/docs/reference/index.html  

**numpy**: mathematical calculations and ancillary data management  
*   Documentation: https://numpy.org/doc/stable/    

**matplotlib**: plotting results  
*   User Guide: https://matplotlib.org/stable/users/index.html
*   Documentation: https://matplotlib.org/stable/api/index.html

**scikit-learn**: general machine learning models     
*   User Guide: https://scikit-learn.org/stable/getting_started.html    
*   Documentation: https://scikit-learn.org/stable/    

**pymc3**: Bayesian models
*   User Guide: https://www.pymc.io/projects/docs/en/stable/learn.html
*   Documentation: https://www.pymc.io/projects/docs/en/stable/api.html

<div class="alert alert-info">
    <b>(2 pts)</b> In the code block below, import pandas, matplotlib, and scikit-learn.
</div>
<div class="alert alert-warning">
    Grading Comments: 
</div>

In [None]:
# Enter code here

## Working with Pandas Data Frames
We can directly pull text file data from Github into a pandas dataframe using the `to_csv()` function. The code block below shows you how to load the data for this assignment. You will use this same method to load data for most future problem sets and labs. The call to `data.head()` will display the first few lines of the dataframe.

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/rtculberg/ml_in_eas/main/data/Jakobshavn.csv")
data.head()

You should see that the dataframe contains the following variables:      

LAT - latitude of the data point (WGS84)       
LON - longitude of the data point (WGS84)          
ICE_THICK - ice thickness in meters at each data point       
QUALITY - quality of the ice thickness measurements. 0 means that quality was not assessed, 1 is high quality, 2 is low quality, and 3 indicates the values were filled in from ancillary sources.          
VELOCITY - ice velocity in meters per year at each data point        
SURFACE_ELEV - surface elevation in meters above sea level (WGS84) at each data point       

Use the built-in pandas functions for dataframes to complete the tasks listed below. You can find a list of all of these functions here: https://pandas.pydata.org/docs/reference/frame.html.

<div class="alert alert-info">
    <b>(5 pts)</b> The QUALITY field indicates the assessed quality of the ice thickness measurements. Drop all of the rows from the dataframe where QUALITY is not equal to 1. Plot a 2D spatial map of the data points (longitude vs. latitude) using the matplotlib <code>scatter</code> function and color each datapoint by the associated ice thickness. Make sure to include axis labels and a colorbar for your plot.    
</div>
<div class="alert alert-warning">
    Grading Comments: 
</div>

**Hint**: see the section on "Boolean Indexing" in the pandas user guide (https://pandas.pydata.org/docs/user_guide/indexing.html) and the documentation for the `drop()` function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html).      
    
**Hint**: See the Matplotlib quick start guide for some basic examples of how to use this package for making figures (https://matplotlib.org/stable/users/explain quick_start.html).

In [None]:
# Enter code here

<div class="alert alert-info">
    <b>(3 pts)</b> Check if there are any NaN values in the ICE_THICK, SURFACE_ELEV, or VELOCITY columns. If so, remove any row that has at least one NaN value in it from the dataframe.
</div>
<div class="alert alert-warning">
    Grading Comments: 
</div>

**Hint**: see the pandas documentation on the `isna()` and `dropna()` functions (https://pandas.pydata.org/docs/reference/frame.html).

In [None]:
# Enter code here

<div class="alert alert-info">
    <b>(3 pts)</b> Now plot a histogram of the cleaned ice thickness.   
</div>
<div class="alert alert-warning">
    Grading Comments: 
</div>

Answer the following question:    
Do you notice any outliers or potential issues with the distribution?

In [None]:
# Enter code here

<div class="alert alert-info">
    <b>(5 pts)</b> Since we are only interested in data on the ice sheet itself, remove rows where the ice thickness is less than ~10 m. Plot a map of the remaining ice thickness measurements. Make sure to include axis labels and colorbar for your plot.
</div>
<div class="alert alert-warning">
    Grading Comments: 
</div>

In [None]:
# Enter code here

<div class="alert alert-info">
    <b>(3 pts)</b> Now we would like to calculate some statistics about our data, such as the mean ice thickness. At this point, we no longer need the quality flags or the positioning information. Create a new dataframe that only contains the ICE_THICK, SURFACE_ELEV, and VELOCITY columns. Display the first few rows of this new dataframe.
</div>
<div class="alert alert-warning">
    Grading Comments: 
</div>

In [None]:
# Enter code here

<div class="alert alert-info">
    <b>(2 pts)</b> Display the mean value of the data in each column.
</div>
<div class="alert alert-warning">
    Grading Comments: 
</div>

In [None]:
# Enter code here

<div class="alert alert-info">
    <b>(2 pts)</b> Display the unbiased standard deviation of the data in each column.
</div>
<div class="alert alert-warning">
    Grading Comments: 
</div>

In [None]:
# Enter code here

<div class="alert alert-info">
    <b>(3 pts)</b> Calculate and plot a visualization of the correlation matrix between the columns in the dataframe.  
</div>
<div class="alert alert-warning">
    Grading Comments: 
</div>

**Hint**: look at the pandas documentation for the `corr()` function and see the second answer to this StackOverflow questions about applying styles to dataframes: https://stackoverflow.com/questions/29432629/plot-correlation-matrix-using-pandas

In [None]:
# Enter code here

## Working with Scikit-Learn

You should have noticed that the correlation coefficient between surface elevation and ice thickness is quite high, suggesting a strong linear relationship between the two variables. To get familiar with the scikit-learn package, we will implement a simple linear regression model to predict ice thickness from surface elevation.     
You can find an example of implementing linear regression in scikit-learn here: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#.   
You can find the documentation for the `LinearRegression` model here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression

Scikit-learn uses an object-oriented programming paradigm. As a result, the general workflow is that you will create a model object, train the model by calling the `fit` function on your model, and passing it your training data. Once you have a trained model, you can call the `predict` function on your trained model and past it the test data (or other unseen data) to predict the results based on those input data.
The code block below has a few lines of code that will split your data into a training and test set. Add additional code to complete the following tasks:  

<div class="alert alert-info">
    <b>(5 pts)</b> Instantiate a new linear regression model object. <br>        
    <b>(5 pts)</b> Train the model with the training set. <br>          
    <b>(5 pts)</b> Make predictions on the test set. <br>              
    <b>(5 pts)</b> Plot the training data and the test set predictions on a single plot (this should look like a scatter plot of the training data, with a best fit line through it from the test set predictions.) <br>       
    <b>(2 pts)</b> Print the slope of the best fit line from the linear model.
</div>
<div class="alert alert-warning">
    Grading Comments: 
</div>

In [None]:
from sklearn import linear_model, model_selection

# Use a random 30% of the data as test data and leave the remaining 70% of the data for training
# Since we want to predict ice thickness from surface elevation using a model of the form
# y = mx + b, data['SURFACE_ELEV'] will be our x variable and data['ICE_THICK'] will be our y
# variable. For reproducibility, we set the random state to 10.
# Note that the regression model expects the x values to be an ndarray, so rather than passing in the
# Series data['SURFACE_ELEV'], we pass its values to get an ndarray and then reshape the output from
# a 1D to 2D array to that the data will be in the shape and format expected by the regression model.
x_train, x_test, y_train, y_test = model_selection.train_test_split(data['SURFACE_ELEV'].values, data['ICE_THICK'], test_size=0.3, random_state=10)
x_train = x_train.reshape(-1,1)
x_test = x_test.reshape(-1,1)

# Enter your code here
