# Learning Goals


- Students will understand the importance of ML, be able to verbally articulate that importance of, and discuss ways in which ML applies to everyday life.
- Students will be able to explain the role of data in ML, as well as the difference between training, validating, and testing data
-  Students will be able to identify errors in example machine learning code when prompted that there is an error.
- Students will be able to argue why the Train/Val/Test split is important.
- Students will be able to modify existing ML code to perform hyperparameter tuning when such code is clearly explained


# Important Libraries

The nice thing about google colab- other than the free computer hardware- is that it comes with a lot of very useful libraries for machine learning already installed. Let's take a look at some of the most important ones now.



> **Numpy**:Numpy is a Python library that handles the creation and manipulation of large datasets through the use of array-like data structures. It is the foundation for a lot of the other libraries we are about to discuss.

> **Matplotlib**: This is a very popular visualization and plotting tool. Pretty much any plot you see done of machine learning content is probably done through Matplotlib.

>**Scipy**: Scipy is another foundational tool with many useful functions for data analysis. It is especially useful for image processing tasks.

>**Scikit-learn**: This library pulls together Scipy, Numpy, and Matplotlib to provide implementations for many popular classical machine learning algorithms.

>**Pandas**: Pandas is another higher-level library that leverages Scipy, Numpy, and Matplotlib to perform data analysis and manipulation. If you are trying to inspect your dataset, pandas probably have whatever you're looking for through its DataFrame class.

>**Tensorflow and Pytorch**: Tensorflow and pytorch are the predominant, easy-to-use interfaces for machine learning. In particular, these libraries are foundational to modern deep learning, which we will be getting into later.  





# But wait, why do we even need ML?


That's a good question, and it's one we have yet to answer. The short version is that some problems are too complex for us to solve manually. Machine learning is a way for us to search for the most optimal solution for a problem.

Let's imagine a simple problem: Suppose that I want you to figure out a student's GPA based on their exam scores. That shouldn't be too hard, right? Scores in your classes relate exactly to your GPA; there is a mathematical function that the school uses to take your scores and turn them into a GPA. Given some time (and maybe access to Google), we should be able to figure out the function that takes us from our input exam scores to our output.

Now, let's complicate the problem. Say that I want to know students' GPA, but I don't have their test scores. Instead, I only have information about their average amount of sleep per night. Does the amount of sleep you get correlate to your grades? Yes, probably. But, it isn't a hard-and-fast rule like the relationship between exam scores and GPA. The situation is more complicated.

If I sat down and spent all day trying to come up with some formula to model this relationship, I *might* be able to come up with something halfway decent. Now, let's consider the situation where I have more input. Let's say I have an average number of hours of sleep, information about parental involvement in the student's learning, what school they attend, which classes they take, if they take AP/IB classes, and how many extracurricular activities they participate in. That's *a lot* more variables to consider, and the chances that any human can sit down and come up with a function that relates all those inputs to a student's GPA with any sort of accuracy is much more slim. In situations like this, we employ machine learning.

Machine learning is used when a problem becomes too complex for us to manually solve. Instead of coming up with a function ourselves, we develop clever algorithms that can figure out an optimal function for us.

# Now let's do something cool!

You now have a sense of why machine learning is important, and are familiar, in passing, with the most popular machine learning libraries. Now, it's time to use them.

You have probably seen the movie *Titanic*. If you haven't seen it, you've probably heard of it. If you haven't heard of it... Well... look it up I guess. The Titanic is, of course, based on the real-life shipwreck of the Titanic. In this exercise, we are going to use machine learning to predict whether passengers aboard the Titanic survived using information about their age, gender, wealth, etc (Credits to Kaggle for the dataset we are going to use).


In [None]:
import pandas as pd

raw_data = pd.read_csv("titanic.csv") # populating dataframe 
raw_data.head()

raw_data.dtype

: 

Let's dig into what that code did. First, we imported the Python module "pandas". As we talked about earlier, pandas is a module that is really good for analyzing and manipulating data. The whole "import pandas *as pd*" business is just creating an alias "pd" for the module "pandas". You are renaming pandas to be pd in your application.

Then, we ran pd.read_csv. This function takes all the data from your file and loads it into the raw_data variable.

Lastly, .head() is a function that acts on the data inside the raw_data variable. It just prints out the first few rows of data in a format that's easy to look at.

Let's take a second to inspect the raw_data variable.

In [56]:
type(raw_data)

pandas.core.frame.DataFrame

The above line of code will tell you what type of object the variable raw_data is. Running the code, you will see that raw_data is a Dataframe object. Dataframes are a custom object provided by pandas that is very useful for data analysis.

Now, if you take a closer look at the data above, you might notice some problems with it. Some columns have a categorical label (such as survived), some have a numerical value (such as age), and others have a string value (like name). Additionally, in some columns, you may have noticed a value of "NaN". The value of NaN can be returned based on different errors in the original data, one of which is missing data. When working with machine learning, these types of data have to be treated differently. Two common approaches to dealing with this are to get rid of any entry that has a NaN, or to interpolate the average value from the existing data to replace the Nan. Be careful with this dataset, because some of the columns, such as Age, do have Nans.

Additionally, it is usually important to think about what parts of your data are important before you apply any sort of machine learning algorithm. This process is called "Feature Engineering". For deep learning, this process is not as important, as deep learning automates feature extraction. However, for any shallow learning algorithms, feature engineering is a must.

In this application, we will skip ahead a bit. Instead of thoroughly investigating each variable and determining which ones correlate with the output most thoroughly, we will just use "Sex" and "Pclass", since, from the story of The Titanic, we know that sex and income correlate well with survival rate.


In [16]:
raw_data.drop(["PassengerId", "Cabin", "Embarked",
               "Name", "Age", "SibSp", "Parch", "Ticket", "Fare"], axis = 1, inplace = True)

# inplace = permanent change 
# axis = y ? 


Great, we have dropped all the rows of data that we don't want. Let's look at what we have again.

In [17]:
raw_data.head()

Unnamed: 0,Survived,Pclass,Sex
0,0,3,male
1,1,1,female
2,1,3,female
3,1,1,female
4,0,3,male


I mentioned earlier that computers can't operate on strings (words) such as "male" and "female". We must assign numbers as labels (0 or 1) instead of the string label currently in the data.

In [18]:
import numpy as np #so we can use the "where" function.
raw_data['Sex'] = np.where(raw_data['Sex'] == 'male', 0, raw_data['Sex'])# read this like "where raw_data['Sex'] = male
#replace 'male' with 0. Otherwise, just use the value in raw_data['Sex'].
raw_data['Sex'] = np.where(raw_data['Sex'] == 'female', 1, raw_data['Sex'])

Now, it's time to split up the data for training.

## Training, Validating, and Testing

All machine learning tasks need a training, validation, and testing dataset. To get these, you take your raw data and simply split it into three parts. Commonly, 80% of your raw data becomes the training dataset, 10% becomes your validation set, and the last 10% becomes your testing dataset.

The reason for the training and testing sets should be fairly intuitive. Think about what you do in school: you learn new information from your teachers, the textbook, the internet, etc. Then, you are given a final exam to see how well you learned that new information. The questions on the exam are *similar* to what you learned from, but they are not the exact same. In the same way, we need to give the algorithm data to learn from, and separate but similar data to test itself on.

The reason for a validation set, however, might not be as clear. Validation sets and Testing sets are really similar; they are both datasets used as an evaluation tool for your algorithm. The key difference is that testing datasets are used *at the end* of the training process as the final check that your algorithm is working as you want it to, and validation is used to check incrementally during training. To continue our school analogy, if the training dataset is like the information in your textbook that you learn from, and the testing dataset is like the questions on your final exam, then the validation dataset is like the quizzes during the school year. Fundamentally, they are no different than the questions on the final exam, but they are *used* differently.

You might be thinking, "Why in the world does a machine learning model have to have quizzes and a final exam? This feels weird and redundant". That's a valid criticism because it seems like a model should be able to just learn with a test dataset "final exam". We will take an in-depth look into the function of the validation dataset in a later module, but I want to at least give a quick sneak peek into the rationale here so you aren't left befuddled. The reason we have all three datasets has to do with two keywords you will see pop up time and time again: **overfitting** and **hyperparameter tuning**.

Overfitting basically means that your model has memorized the training dataset, but not the underlying patterns. It's like if you sat down and memorized a bunch of example problems from your math book, but never actually took the time to understand what was happening. What happens when it comes to quiz time? You fail because you never learned how to actually do the problem, *you just memorized the examples you were given*. The same thing can happen to a machine learning model, and giving it "quizzes" along the way can help us catch when the model isn't actually learning, but is just memorizing the training dataset. When it gets to the validation dataset, we can see it fail and make adjustments.

How do we make adjustments? That's where hyperparameter tuning comes in. Hyperparameters are variables you can change to modify the way in which a machine learning network behaves. However, it is possible to overtune the model to perform well on the validation dataset, which will again lead to overfitting. But this time, you won't be able to notice the overfitting because you have made it overfit to the validation dataset. This is where the test dataset comes into play; the test dataset makes sure *you yourself* didn't cause the model to overfit.

To make an analogy, imagine that your machine learning model is a physical machine -- perhaps a racecar. Hyperparameters are the things you can tweak and tune to make your model -- your racecar -- work better. You can tweak things like engine torque, wheel size, and chassis shape to try to make the car run better on a given track.

Let's say you are a racecar engineer. You design a car and you run simulations and such to make sure your car will work, based on data from previous cars built. This is like training your model on the training dataset. Then, you build a prototype racecar, bring it to the track, and test it. After testing, you realize you could change some things to make it run better -- analysis shows that increasing engine torque would make it run better on the track because the track is really hilly. So, you go back and tweak your design to have higher torque. This is like hyperparameter tuning. You do this process for a while, tweaking the torque until you are convinced you have the best racecar possible. You ship it to production and it goes out to the world. But soon, you get complaints from customers saying that your racecar is being outrun by most other racecars on all tracks that aren't hilly! Why did this happen? You overturned you racecar design for specifically hilly tracks when it needed to perform on all types of tracks. So, when the racecar performs on anything other than its track (the track it was validated on), it performs poorly. If only we had tested on other racetracks before shipping to production!

This is, in essence, what can happen to machine learning models when you do hyperparameter tuning. If you tune your racecar -- your ML model -- to perform really well on one track (on a validation dataset), you run the risk of making it perform poorly on real racetracks (the data you use your model on in the real world). To safeguard against this, we use a testing dataset (other racetracks) to make sure that our model can perform on similar data that we didn't specifically tune to perform well on.

So, now that we have covered why you need to split raw data into three sets, let's actually do it.



In [19]:
from sklearn.model_selection import train_test_split
train_data ,test_and_val_data = train_test_split(raw_data, test_size = .20, random_state = 2) #split into train and an aggragated test and val
val_data, test_data = train_test_split(test_and_val_data, test_size = .50, random_state = 2)#split aggregagted into test and val

This code uses the library scikit-learn (which we talked about earlier) to split our raw data into train, test, and validation. Read about the train_test_split function [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

That link takes you to the published documentation on scikit-learn. All major libraries you use in machine learning (and anywhere else, really) will have documentation published on how their functions work, how you can use them, and (often) example code of it in use. The documentation (sometimes called an API) is your primary resource in figuring out how to use any function included in Python libraries.

**OFTEN, LOOKING AT THE PUBLISHED DOCUMENTATION IS EASIER THAN JUST GOOGLING IT**



## Let's get to work!

Now, let's take our data and get to classifying. For this exercise, we are going to be using an algorithm called random forests. It's not super critical that you have a deep understanding of how the algorithm works. If you want to learn more, [here](https://www.analyticsvidhya.com/blog/2021/10/an-introduction-to-random-forest-algorithm-for-beginners/) is a good resource on random forests.

First, we split the test data into input and output. "Y" is our labels for who survived. "X" are our input variables: Purchasing class and sex. Then, we make a Random Forest Model with 10 trees in our forest, and a max_depth of 5 on each tree.

In [20]:
from sklearn.ensemble import RandomForestClassifier
Y = train_data['Survived']
X = train_data[['Pclass', 'Sex']]
model = RandomForestClassifier(10, max_depth = 10)


Next, we "fit" the model. This is where the algorithm actually does the learning. It is called "fit" because we are "fitting" the algorithm onto our training data.

In [21]:
model.fit(X,Y)

Now, we can evaluate how well our model works on our validation data. If you followed all the above instructions, your model should be about 75% accurate.

In [22]:
val_Y = val_data['Survived']
val_X = val_data[['Pclass', 'Sex']]
predictions = model.predict(val_X)

In [23]:
# DO NOT CHANGE THIS CELL
accuracy = np.sum(val_Y == predictions)/len(val_X)*100
print("Accuracy: " , accuracy)

Accuracy:  75.28089887640449


# Now, make some changes!

Congratulations! You have now trained your first machine learning model! Now, it's time to make it your own. I only specified the number of trees and the depth of the trees, but there are more hyperparameters you can tune. Look at the documentation, here at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. Make your own changes, experiment, and see if you can get better results.

Then, it's time for a real challenge: you should try to include more rows of data. If you remember we dropped many rows of data earlier for the sake of simplicity. However, there is a lot of information embedded into that data, and a machine learning model is only as good as its data. More than likely, including more relevant data will lead to more improvements than hyperparameter tuning. Each type of data has its own issues. Some have missing data points (NaNs), which will cause the training to crash unless you either get rid of those data points or interpolate some value for the missing value. Some are categorical, others aren't. Categorical data are either number labels (e.g.: 1,2,3) or strings (words) that need to be converted to number labels. before you can use it. Pick one column of data that you think would be relevant to determining which people survived the Titanic and which ones didn't, and then try to incorporate it into the model.

When you run into issues, ask yourself:

1) Are there any NaN (missing values) in the data column I am using?

2) Are there any strings (words) left in your data?

You may run into other issues as well. Google is your friend here; when you get an error code, be sure to google that error code and try to figure out what it means. Then, once you understand what the error is telling you, try to find where the error is in your code.

When you're done, evaluate your model against the testing dataset to get your final accuracy.

To get you started, we will go ahead and copy all relevant code from above down so that you have it all in one place. **THERE IS NO NEED TO RUN ANY CODE FROM ABOVE THIS POINT DURING THIS EXERCISE**. Just make modifications to the code below.

---



In [55]:
import pandas as pd

# read the data file
raw_data = pd.read_csv('titanic.csv')
raw_data.head()

type(raw_data)

# drop columns we will not be using
raw_data.drop(["PassengerId", "Cabin", "Embarked",
               "Name", "SibSp", "Parch", "Ticket"], axis = 1, inplace = True)

raw_data.head()


# format the data
import numpy as np #so we can use the "where" function.
raw_data['Sex'] = np.where(raw_data['Sex'] == 'male', 0, raw_data['Sex'])# read this like "where raw_data['Sex'] = male
#replace 'male' with 0. Otherwise, just use the value in raw_data['Sex'].
raw_data['Sex'] = np.where(raw_data['Sex'] == 'female', 1, raw_data['Sex'])

raw_data.dropna(subset=['Age'], inplace=True)
# raw_data = raw_data.interpolate(method='linear', axis=0).ffill()


# split the data
from sklearn.model_selection import train_test_split
train_data ,test_and_val_data = train_test_split(raw_data, test_size = .20, random_state = 2) #split into train and an aggragated test and val
val_data, test_data = train_test_split(test_and_val_data, test_size = .50, random_state = 2)#split aggregagted into test and val

# split data into input and output
from sklearn.ensemble import RandomForestClassifier
Y = train_data['Survived']
X = train_data[['Sex']]
model = RandomForestClassifier(100, max_depth = 5)

# train the model
model.fit(X,Y)

# evaluate based on accuracy
test_Y = test_data['Survived']
test_X = test_data[['Sex']]
predictions_test = model.predict(test_X)
accuracy_test = np.sum(test_Y == predictions_test)/len(test_X)*100
print("Final Accuracy: " , accuracy_test)






Final Accuracy:  86.11111111111111


# Acknowledgements

Authors: Caden Hamrick, JoHanna Rodenbeck, Louisa Houser, Shane Tharani