There are already a bunch of awesome Scripts, but I wanted to step back and work with some more rudimentary models to make sure I was doing the right data preparation.

Let's start by loading our packages and data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from scipy.stats import skew, boxcox
import statsmodels.formula.api as smf

# Load Training Data
train = pd.read_csv('../input/train.csv', dtype={'id': np.int32})

# Load Test Data
test = pd.read_csv('../input/test.csv', dtype={'id': np.int32})

Nomenclature note: The outcome variable for this competition is 'loss'. (If you read much machine learning literature, you've probably heard the term loss as in '[loss function](https://en.wikipedia.org/wiki/Loss_function)'.) That isn't exactly what we mean in this context. The 'loss' variable in this case literally refers to the amount AllState lost on the settlement. Wherever you see 'loss' in this document, assume I'm talking about the amount AllState lost, and not the output of a loss function.

Now, prediction is easier on an outcome that's normally distributed. Let's check to see if this data is:

In [None]:
plt.hist(train['loss'], 30, normed=1)
plt.xlabel('Loss')
plt.ylabel('Probability')
plt.title('Distribution of Losses')
plt.show()

Wow. That isn't normally distributed at all: it's super *[skewed](https://en.wikipedia.org/wiki/Skewness)*.

In [None]:
skew(train['loss'])

Any skew greater than one should probably catch your attention. Luckily, we have a simple counterspell! Let's *log-transform* the 'loss' variable.

In [None]:
train['log_loss'] = np.log(train['loss'])

plt.hist(train['log_loss'], 30, normed=1)
plt.xlabel('Log(Loss)')
plt.ylabel('Probability')
plt.title('Distribution of Log(Loss)es')
plt.show()

Much Better. Now, what about our input variables? Are they similarly skewed?

In [None]:
features_numeric = test.dtypes[test.dtypes != "object"].index
features_skewed = train[features_numeric].apply(lambda x: skew(x.dropna()))
features_skewed

Some of them, yeah. We can fix that by taking their log-transforms as well, but log is sort of a blunt instrument. It's easily reversible, which makes it good for the outcome. But the Box-Cox transform is a better tool for modifying our inputs. Let's apply it to any features with a skew greater than, say, .2

In [None]:
features_skewed = features_skewed[features_skewed > 0.2]
for feat in features_skewed.index:
    train[feat], lam = boxcox(train[feat] + 1)
    test[feat] = boxcox(test[feat] + 1, lam)

features_skewed = train[features_numeric].apply(lambda x: skew(x.dropna()))
features_skewed

That eliminated much of the skewness. Before we move on, however, I'd like to call attention to the way we handle `lam` in the above block. We let `boxcox` figure out the optimal `lam` using our training data, and then force it to use that same `lam` on the test data, even if it isn't necessarily optimal for the test data. The alternative approach is to bind `train` and `test` together, perform these transformations on the entire set, and then split them back apart when it comes time to build models. I've opted not to for the benefit of clarity, but possibly at the cost of some small modeling advantage.

Now, we have some categorical features we need to handle. The textbook approach to Linear Regression says you can leave categorical variables in, provided you do something like *[one-hot encode](https://en.wikipedia.org/wiki/One-hot)* them and leave out the smallest category. Personally, I prefer to replace the category with the arithmetic mean of its corresponding subset of outcomes.

In [None]:
features_categorical = [feat for feat in test.columns if 'cat' in feat]

for feat in features_categorical:
    a = pd.DataFrame(train['log_loss'].groupby([train[feat]]).mean())
    a[feat] = a.index
    train[feat] = pd.merge(left=train, right=a, how='left', on=feat)['log_loss_y']
    test[feat] = pd.merge(left=test, right=a, how='left', on=feat)['log_loss']

features_categorical = test.dtypes[test.dtypes == "object"].index

There's just one more thing to check on. Linear Regression generally doesn't handle missing values very well. Let's see if we have any:

In [None]:
counts = train.count()
len(counts[counts < train.shape[0]])

Not in the training dataset. Let's check `test` now:

In [None]:
counts = test.count()
len(counts[counts < test.shape[0]])

Rats. OK, Rather than design a elaborate solution, I'm just going to drop any columns with missing values.

In [None]:
temp = test.dropna(1)
counts = temp.count()
len(counts[counts < temp.shape[0]])

Cool. Now, we're ready to make a model. 

In [None]:
model = smf.ols('log_loss ~ ' + ' + '.join(temp.columns), data=train).fit()
model.summary()

There's a lot of useful information here. However, since this is a prediction challenge, I'm not interested in most of it. Instead, I'm interested in how well it can predict new values. To do that...

In [None]:
yhat = np.exp(model.predict(test))

Note that we call `np.exp` on our model predictions. Remember how we log-transformed 'loss' up at the beginning of this script? Exponentiating the outcome sort of undoes that, so our predictions will be on the same scale as 'loss' instead of 'log_loss'. Forgetting this step is a really good way to get a terrible score.

Now that we have some predictions, let's write them out and score them!

In [None]:
result = pd.DataFrame({'id': test['id'].values, 'loss': yhat})
result = result.set_index('id')
result.to_csv('simplelmprediction.csv', index=True, index_label='id')

If you submit that, it should give you a score something like 1245.99. That's a bit worse than the Random Forest Benchmark (which isn't surprising). Onward to greater refinements!

Good luck!