# HOTH-6 Intro to ML Workshop

Follow along at [https://github.com/uclaacmai/hoth-6-workshop](https://github.com/uclaacmai/hoth-6-workshop)!

In this workshop, we'll go over the basic steps behind what a typical machine learning workflow looks like so that you can then use that knowledge to apply ML in your hack!

![](https://cdn-images-1.medium.com/max/2000/1*KzmIUYPmxgEHhXX7SlbP4w.jpeg)

## Sample Workflow

Let's think about what we need to do when approaching any machine learning competition/problem. Like we talked about during the slides, any problem involving some sort of prediction task creates the possibility for us to apply machine learning to it.  

1) Determine your problem space. Do you have a classification problem, or a regression problem? 

2) Determine what model you want to use (Always good to start off with simple models).

3) Load in and preprocess your dataset. Examine your database to see if there are any NULL or non-numeric values.

4) Split up your dataset into training and testing components. 

5) Create your model. Depending on the libraries you are using, this could entail defining your function, your placeholders, the loss function, and the optimizer. 

6) Train, evaluate, and iterate on your model!

7) Once you have a model that you've trained and that you're satisfied with, you can deploy it to a server! [Flask](http://flask.pocoo.org/) is often a good choice for serving ML models. 

## Using ML in the Titanic Competition

To show everybody what this workflow looks like in practice, we'll do the classic Titanic Prediction problem that is hosted on [Kaggle](https://www.kaggle.com/c/titanic). The goal is to be able to predict who survived and who passed away during the Titanic tragedy, given information about the people involved.

### Getting the Data

You can download the data from the Kaggle website. The direct link is [here](https://www.kaggle.com/c/titanic/data)

In [0]:
import pandas as pd

In [0]:
# Use the Pandas read_csv() function to load in the train.csv
titanicTrain = pd.read_csv('https://raw.githubusercontent.com/uclaacmai/hoth-6-workshop/master/train.csv')

### Examining Data

In [0]:
# Figure out what the different column names are
titanicTrain.columns.tolist()
titanicTrain.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Cleaning Data

This is one of the most important parts of any machine learning pipeline. We want to make sure that the inputs we feed into any machine learning model are are valid, non-null, and are numerical values. To get you started with datapreprocessing, we'll show you one example of a column you may want to drop in this dataset 

In [0]:
# Visualize the data we're working with
titanicTrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


So, as you can see above, some of the people don't have values for the age and cabin attributes. There are ways we can deal with this (for example, replace the null values with the median of the other values, replace them with 0, etc), but a simple method is to just drop the column.

In [0]:
# Drop the column
titanicTrain.drop(['Cabin'], axis = 1, inplace = True)
# alternatively, if you don't wish to modify the original data structure, you can re-assign the result of.drop().
#titanicTrain_dropped = titanicTrain.drop(['Cabin'], axis = 1) # For axis number (0 for rows and 1 for columns)


Another column that needs processing is the age.

In [0]:
# Do the preprocessing
medianAge = titanicTrain['Age'].median()
titanicTrain['Age'].fillna(medianAge, inplace = True)

Now, try it on your own! The functions you will probably be using are (although you're not limited to just these!):
- [drop()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)
- [fillna](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)
- [get_dummies()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)
- [dropna()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)

In [0]:
titanicTrain.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [0]:
# TODO Find the other attributes that may give us trouble later on! Once you find these
# columns, figure out if you just want to drop the attribute altogether or replace with 
# median, or something else!

# HINT: The name attribute is something you may want to look at. We don't want strings in our ML model!

Now that you know a couple ways of dealing with null values and string values, feel free to be creative! The best way to get a more accurate machine learning model is to understand the best ways to visualize and clean your data! This is one of the most important steps in any ML pipeline. 

### Creating Testing and Training Matrices

So, now that we've made our final changes to our dataframe, we want to convert it into a matrix of numbers. We want our Y Matrix to be filled with binary labels indicating whether the person survived or not. Our X Matrix should contain all of the features that represent each individual.  

In [0]:
titanicTrain.info()
titanicTrain.head()
mapping = {
    'female': 1,
    'male': 0
}
titanicTrain['Sex'] = titanicTrain['Sex'].map(mapping)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.6+ KB


In [0]:
# Convert to matrices. 
# TODO Add/Remove columns as you see fit
print(titanicTrain.columns)
X = titanicTrain[['Pclass', 'Age', 'SibSp', 'Sex', 'Parch', 'Fare']].as_matrix()
Y = titanicTrain['Survived'].as_matrix()
Y = Y.reshape([Y.shape[0], 1]) # Reshaping from (891,) to (891,1)
print (X.shape)
print (Y.shape)
import numpy as np

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked'],
      dtype='object')
(891, 6)
(891, 1)


Remember that whenever we have a dataset, it's good practice to seperate the dataset into 2 parts, one that we will use to train the model, and one that we will use to check how our model is doing as a test/validation set.

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
import numpy as np

(668, 6)
(223, 6)
(668, 1)
(223, 1)


### Create Model



In [0]:
from sklearn import linear_model

# Some other models you can import:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

In [0]:
model = linear_model.LogisticRegression()

# Comment out the below line if you want to try some other classifiers
# model = RandomForestClassifier()
# model = AdaBoostClassifier()
# model = GradientBoostingClassifier()
# model = KNeighborsClassifier()

model.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [0]:
predictions = model.predict(X_test)
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print("The accuracy is:", accuracy_score(y_test,predictions))
print(classification_report(y_test,predictions))

The accuracy is: 0.7982062780269058
              precision    recall  f1-score   support

           0       0.86      0.85      0.85       155
           1       0.67      0.68      0.67        68

   micro avg       0.80      0.80      0.80       223
   macro avg       0.76      0.76      0.76       223
weighted avg       0.80      0.80      0.80       223



### How This Can Help With Your Hack!

So today we saw what a typical machine learning pipeline looks like. Machine learning models can be incredibly powerful given you have an appropriate problem space and relevant data + labels. In terms of bringing ML into your hack, try to think about if there is a prediction problem that you would like to automate (determine the sentiment of a text, if something is a hot dog or not, if you should recommend something to someone) and then determine if there is relevant data + labels for that problem. 

Once you've trained a problem, then you can look to deploying the model. Flask is a Python webserver that accomplishes this. You can take a look at the following links for more information: 

- http://flask.pocoo.org/
- https://towardsdatascience.com/a-flask-api-for-serving-scikit-learn-models-c8bcdaa41daa 
- https://hackernoon.com/deploy-a-machine-learning-model-using-flask-da580f84e60c
- https://medium.com/coinmonks/deploy-your-first-deep-learning-neural-network-model-using-flask-keras-tensorflow-in-python-f4bb7309fc49

Come up to us afterwards if you have any questions!