# Connect Intensive - Machine Learning Nanodegree

## Week 9. Porto Seguro’s Safe Driver Prediction - A Kaggle Competition

### Objectives    

  - Get familiar with the process of creating machine learning solutions to a real problem  
  - Practice skills such as data analysis and visualization, model building, evaluation and optimization on a real-world dataset
  
### Prerequisites
  - [matplotlib](http://matplotlib.org/index.html)  
  - [numpy](http://www.scipy.org/scipylib/download.html)  
  - [pandas](http://pandas.pydata.org/getpandas.html)  
  - [sklearn](http://scikit-learn.org/stable/install.html)  


## Competition Overview

> **More details about the competition can be found [here](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction).**

Nothing ruins the thrill of buying a brand new car more quickly than seeing your new insurance bill. The sting’s even more painful when you know you’re a good driver. It doesn’t seem fair that you have to pay so much if you’ve been cautious on the road for years.

Porto Seguro, one of Brazil’s largest auto and homeowner insurance companies, completely agrees. Inaccuracies in car insurance company’s claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones.

In this competition, you’re challenged to build a model that **predicts the probability that a driver will initiate an auto insurance claim in the next year** - thus a *supervised binary classification* problem. While Porto Seguro has used machine learning for the past 20 years, they’re looking to Kaggle’s machine learning community to explore new, more powerful methods. A more accurate prediction will allow them to further tailor their prices, and hopefully make auto insurance coverage more accessible to more drivers.  

*Note:* The problem is well defined for you here, but in reality, you may be the one to frame the problem. 

## Our Steps

We can break the process into separate steps. You can use the links below to navigate the notebook.

[Step 0](#step0): Get the Data    
[Step 1](#step1): Data Exploration  
[Step 2](#step2): Prepare Data for Machine Learning Algorithms  
[Step 3](#step3): Select and Train a Model  
[Step 4](#step4): Fine Tune the Model  
[Step 5](#step5): Generate Submission File

---

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

%matplotlib inline

<a id='step0'></a>
## Step 0: Get the Data  

### Download the Data Files

The first step is to download the data. Go to the [competition page]() and navigate to the data tab. You can see the **"download"** button in the page. Note that you will need to log in to your Kaggle account and click the **"I understand and accept"** button. 

You can find descriptions about the files and data from the data page. Below is the information taken from that page:  

> **Data Description:** In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., `ind`, `reg`, `car`, `calc`). In addition, feature names include the postfix **`bin`** to indicate binary features and **`cat`** to indicate categorical features. Features without these designations are either continuous or ordinal. Values of **-1** indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.  

> **File Descriptions:** **`train.csv`** contains the training data, where each row corresponds to a policy holder, and the target columns signifies that a claim was filed. **`test.csv`** contains the test data. **`sample_submission.csv`** is the submission file showing the correct format.

**TODO**   
Download the files following the instructions above. Unzip the downloaded files. Create a folder named **`data`** in the directory where this notebook is. Place the unzipped files in the **`data/`** folder. 

### Read the Data into Pandas 

Now you should have 3 csv files in the **`data/`** folder. Read them into Pandas dataframe and take a quick look at the data. 

**TODO** 
Use the following cells (feel free to add more if needed) to read the data and take a quick look at the data.

In [None]:
# read './data/train.csv' 
train_df = ...

In [None]:
# find out the shape of the data, what are the features, what is the name of the target variable, 
# what are the data types, and get a statistical description of the data




> **Optional:** You will notice that the data types include *int64* and *float64*. One thing you can do to reduce the memory usage of the dataframe is to convert the data types into *int32* and *float32*. Can you write a function to perform this task?  

In [None]:
# read './data/test.csv' 
test_df = ... 

In [None]:
# what does the test data look like? 




In [None]:
# read './data/sample_submission.csv' 
submission_df = ... 

In [None]:
# what should your submission format look like?  




### Create a Validation Set

In [None]:
# create a validation set (20%)  
from sklearn.model_selection import train_test_split
train, test = ...

**QUESTION**  
Why do we want to create a validation set now?    

<a id='step1'></a>
## Step 1: Data Exploration  

Make sure you have put aside a validation set and you only explore the training set. Also, if the training set is very large, you may want to sample an exploration set to speed up the exploration. 

### Visualize Features  

**TODO**  
Use the code cells below (feel free to add more if needed) and create at least 3 visualizations using `matplotlib` or `seaborn`. What insights can you get from the visualizations?  

*Hint:* distributions of individual features, correlations between features, correlation matrix...also, for classification problem, you might want to also check the class distribution  

In [None]:
# visualization 1: 



In [None]:
# visualization 2:  



In [None]:
# visualization 3: 



<a id='step2'></a>
## Step 2: Prepare Data for Machine Learning Algorithms   

Now, it is time to prepare the data for machine learning algorithms. 

### Data Cleaning  

**TODO:** 
In this dataset, missing values are encoded as `-1`. Find out the percentage of missing values for each feature. What strategies would you use to deal with the missing values? Implement your strategies.   

### Categorical Variables  

**TODO:**  
Convert categorical values into dummy variables either by `get_dummies()` from pandas or `OneHotEncoder()` from scikit-learn. What happens if some values for some categorical features only present in the training or only in the test sets? Implement your strategies. 

### Feature Scaling   

Feature scaling is one of the most important transformations you need to apply because, with few exceptions, machine learning algorithms don't perform well when the input numerical features have very different scales. Some examples of algorithms where feature scaling matters are:

- k-nearest neighbors with an Euclidean distance measure if want all features to contribute equally  
- k-means (for reasons similar to k-nearest neighbors) 
- logistic regression, SVMs, perceptrons, neural networks etc. if you are using gradient descent/ascent-based optimization, because otherwise some weights will be updated much faster than others   
- linear discriminant analysis, principal component analysis, kernel principal component analysis since you want to find directions of maximizing the variance; you want to have features on the same scale since otherwise you’d emphasize variables on "larger measurement scales" more.   

**TODO:**  
Perform feature scaling on the dataset. Should we use standardization or min-max scaling? Why? 

### Other Feature Transformation  

**QUESTION:**  
Do you think there are other feature transformations necessary, e.g., PCA?  

<a id='step3'></a>
## Step 3: Select and Train a Model  


### Train and Evaluate on one Model 

**TODO:**  
Select and train one model. Why do you select the model? What is the training score and what is the validation score? Is this the final model you would choose for fine tuning?     

### Cross-Validation

**TODO:**  
Select 3-5 different different algorithms, and use cross-validation to compare the performance and running time. Which model would you choose for further fine tuning?  

<a id=step4></a>
## Step 4: Fine Tune the Model 

### Grid Search  

**TODO:**
Optimize the chosen model from the previous step by Grid Search. Do you think Randomized Search would be better in this case? What are the training and test scores after model optimization?   

<a id=step5></a>
## Step 5: Generate Submission File  

**TODO:**  
Now you can make predictions with the above model on the instances in **`test_df`** and generate a submission file. You can submit your file to Kaggle. What is your public leader board score and ranking?  