```
Created: 2019-09-20
Author: Roy Wilds

Updates
2019-11-17: Cleaned up for push to github
```

# About this notebook
This notebook captures the typical steps involved in building a regression model using pandas and sklearn.

# Data Loading
This uses the laptopbatterylife CSV file from HackerRank challenge: https://www.hackerrank.com/challenges/battery/problem

It's a very simple dataset, but actually it was kinda handy in figuring out exactly what the various methods expect in terms of a DataFrame, Series, Array, etc.

Two models were investigated:
- Good old LinearRegression (spoiler: it did bad since this data isn't linearly distributed)
- Support Vector Machine Regression (SVR)

In [None]:
import pandas as pd

In [None]:
csvfile = '~/data/laptopbatterylife.txt'
df = pd.read_csv(csvfile, header=None, names=['charge_duration','batterylife_duration'])
df.count()

In [None]:
df.sample(5)

# Data Manipulation

In [None]:
df.dtypes

No data manipulation required. Columns loaded as floats as desired.

# Data Exploration
Always good to understand the raw data before jumping into modeling.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.scatter(df['charge_duration'], df['batterylife_duration'], color='black')

plt.xticks(())
plt.yticks(())

plt.show()

Based on the above, I'm not sure a linear regression model will be much good. But, we'll start simple.

# Modeling
Going to build a model to predict the `batterylife_duration` from the `charge_duration`.

Am including the test/train split here, since it's used in all supervised modeling subsections below.

In [None]:
from sklearn.model_selection import train_test_split 

In [None]:
# data must be a DF for the builtin sklearn methods... labels can be Series though.
data = df[['charge_duration']]
labels = df['batterylife_duration']

# Make train/test sets with a 30% test size.
data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.3)

In [None]:
data_train.head(5)

In [None]:
labels_train.head(5)

## Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linreg = LinearRegression()

In [None]:
linreg.fit(data_train, labels_train)

In [None]:
pred_test = linreg.predict(data_test)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# The coefficients
print('Coefficients: \n', linreg.coef_)
# The mean squared error
print("Mean squared prediction error: %.2f"
      % mean_squared_error(labels_test, pred_test))
# Explained variance score: 1 is perfect prediction
print('Variance prediction score: %.2f' % r2_score(labels_test, pred_test))

In [None]:
# Plot outputs
plt.scatter(data_test, labels_test,  color='black')
plt.plot(data_test, pred_test, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

As expected earlier, not great, but it's something.

## Support Vector Regression
This doesn't scale well to big datasets, but will be fine for 100 data points.

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR

In [None]:
from sklearn.svm import SVR

In [None]:
svrmodel = SVR(gamma='auto')

In [None]:
svrmodel.fit(data_train, labels_train)

In [None]:
# For whatever reason, svrmode.predict() returns a numpy array. Need it to be a Series to compare with label_test
pred_test = pd.Series(svrmodel.predict(data_test))
pred_test[0:4]

In [None]:
labels_test[0:4]

In [None]:
# The params
print('Params: \n', svrmodel.get_params())
# The mean squared error
print("Mean squared prediction error: %.2f"
      % mean_squared_error(labels_test, pred_test))
# Explained variance score: 1 is perfect prediction
print('Variance prediction score: %.2f' % r2_score(labels_test, pred_test))

In [None]:
# Plot outputs
plt.scatter(data_test, labels_test,  color='black')
plt.scatter(data_test, pred_test, color='blue', marker='x')

plt.xticks(())
plt.yticks(())

plt.show()

Way way better than the Linear Regression results (visually and in terms of MSE).