<a href="https://colab.research.google.com/github/quinn-dougherty/AB-Demo/blob/master/module4-ridge-regression/ridge_regression_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  Resources & stretch goals:
- https://www.quora.com/What-is-regularization-in-machine-learning
- https://blogs.sas.com/content/subconsciousmusings/2017/07/06/how-to-use-regularization-to-prevent-model-overfitting/
- https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/
- https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b
- https://stats.stackexchange.com/questions/111017/question-about-standardizing-in-ridge-regression#111022

Stretch goals:
- Revisit past data you've fit OLS models to, and see if there's an `alpha` such that ridge regression results in a model with lower MSE on a train/test split
- Yes, Ridge can be applied to classification! Check out [sklearn.linear_model.RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier), and try it on a problem you previous approached with a different classifier (note - scikit LogisticRegression also automatically penalizes based on the $L^2$ norm, so the difference won't be as dramatic)
- Implement your own function to calculate the full cost that ridge regression is optimizing (the sum of squared residuals + `alpha` times the sum of squared coefficients) - this alone won't fit a model, but you can use it to verify cost of trained models and that the coefficients from the equivalent OLS (without regularization) may have a higher cost

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy.testing import assert_almost_equal

from sklearn import linear_model
from sklearn.preprocessing import StandardScaler, scale
from sklearn.linear_model import LinearRegression, Ridge, LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import altair as alt


# Assignment

Following is data describing characteristics of blog posts, with a target feature of how many comments will be posted in the following 24 hours.

https://archive.ics.uci.edu/ml/datasets/BlogFeedback

Investigate - you can try both linear and ridge. You can also sample to smaller data size and see if that makes ridge more important. Don't forget to scale!

Focus on the training data, but if you want to load and compare to any of the test data files you can also do that.

Note - Ridge may not be that fundamentally superior in this case. That's OK! It's still good to practice both, and see if you can find parameters or sample sizes where ridge does generalize and perform better.

When you've fit models to your satisfaction, answer the following question:

```
Did you find cases where Ridge performed better? If so, describe (alpha parameter, sample size, any other relevant info/processing). If not, what do you think that tells you about the data?
```

You can create whatever plots, tables, or other results support your argument. In this case, your target audience is a fellow data scientist, *not* a layperson, so feel free to dig in!

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy.testing import assert_almost_equal

from sklearn import linear_model
from sklearn.preprocessing import StandardScaler, scale
from sklearn.linear_model import LinearRegression, Ridge, LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import altair as alt


In [6]:
'''
Data Set Information:

This data originates from blog posts. The raw HTML-documents
of the blog posts were crawled and processed.
The prediction task associated with the data is the prediction
of the number of comments in the upcoming 24 hours. In order
to simulate this situation, we choose a basetime (in the past)
and select the blog posts that were published at most
72 hours before the selected base date/time. Then, we calculate
all the features of the selected blog posts from the information
that was available at the basetime, therefore each instance
corresponds to a blog post. The target is the number of
comments that the blog post received in the next 24 hours
relative to the basetime.

In the train data, the basetimes were in the years
2010 and 2011. In the test data the basetimes were
in February and March 2012. This simulates the real-world
situtation in which training data from the past is available
to predict events in the future.

The train data was generated from different basetimes that may
temporally overlap. Therefore, if you simply split the train
into disjoint partitions, the underlying time intervals may
overlap. Therefore, the you should use the provided, temporally
disjoint train and test splits in order to ensure that the
evaluation is fair.

Attribute Information:

1...50:
Average, standard deviation, min, max and median of the
Attributes 51...60 for the source of the current blog post
With source we mean the blog on which the post appeared.
For example, myblog.blog.org would be the source of
the post myblog.blog.org/post_2010_09_10
51: Total number of comments before basetime
52: Number of comments in the last 24 hours before the
basetime
53: Let T1 denote the datetime 48 hours before basetime,
Let T2 denote the datetime 24 hours before basetime.
This attribute is the number of comments in the time period
between T1 and T2
54: Number of comments in the first 24 hours after the
publication of the blog post, but before basetime
55: The difference of Attribute 52 and Attribute 53
56...60:
The same features as the attributes 51...55, but
features 56...60 refer to the number of links (trackbacks),
while features 51...55 refer to the number of comments.
61: The length of time between the publication of the blog post
and basetime
62: The length of the blog post
63...262:
The 200 bag of words features for 200 frequent words of the
text of the blog post
263...269: binary indicator features (0 or 1) for the weekday
(Monday...Sunday) of the basetime
270...276: binary indicator features (0 or 1) for the weekday
(Monday...Sunday) of the date of publication of the blog
post
277: Number of parent pages: we consider a blog post P as a
parent of blog post B, if B is a reply (trackback) to
blog post P.
278...280:
Minimum, maximum, average number of comments that the
parents received
281: The target: the number of comments in the next 24 hours
(relative to basetime)'''

zipurl='https://archive.ics.uci.edu/ml/machine-learning-databases/00304/BlogFeedback.zip' 

!wget unzip https://archive.ics.uci.edu/ml/machine-learning-databases/00304/BlogFeedback.zip
  
!unzip BlogFeedback
#!ls 

#!cat blogData_test-2012.02.01.00_00.csv

--2019-01-25 01:56:02--  http://unzip/
Resolving unzip (unzip)... failed: Name or service not known.
wget: unable to resolve host address ‘unzip’
--2019-01-25 01:56:02--  https://archive.ics.uci.edu/ml/machine-learning-databases/00304/BlogFeedback.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2583605 (2.5M) [application/zip]
Saving to: ‘BlogFeedback.zip.1’


2019-01-25 01:56:03 (3.59 MB/s) - ‘BlogFeedback.zip.1’ saved [2583605/2583605]

FINISHED --2019-01-25 01:56:03--
Total wall clock time: 0.8s
Downloaded: 1 files, 2.5M in 0.7s (3.59 MB/s)
Archive:  BlogFeedback.zip
  inflating: blogData_test-2012.02.01.00_00.csv  
  inflating: blogData_test-2012.02.02.00_00.csv  
  inflating: blogData_test-2012.02.03.00_00.csv  
  inflating: blogData_test-2012.02.04.00_00.csv  
  inflating: blogData_test-2012.02.05.00_00.csv  
  

In [0]:
train_url = 'blogData_train.csv'

test1url='blogData_test-2012.02.01.00_00.csv'
test2url='blogData_test-2012.03.31.01_00.csv'

train_ = pd.read_csv(train_url, header=None)
df1_ = pd.read_csv(test1url, header=None)
df2_ = pd.read_csv(test2url, header=None)


In [8]:
def center_and_unitvari(dat): 
  sc = StandardScaler()
  dat_sc = sc.fit_transform(dat)
  
  newdat = pd.DataFrame(dat_sc)
  newdat.columns = newdat.columns+1
  assert_almost_equal(newdat.mean().values, 0)
  #ssert_almost_equal(newdat.std().values, 1, 2)
  return newdat

train = center_and_unitvari(train_) 
df1 = center_and_unitvari(df1_)
df2 = center_and_unitvari(df2_)

assert all([x==0 for x in train.isna().sum().values + df1.isna().sum().values + df2.isna().sum().values])

assert all([all(train.apply(lambda x: np.issubdtype(x, np.number)).values), 
            all(df1.apply(lambda x: np.issubdtype(x, np.number)).values),
            all(df2.apply(lambda x: np.issubdtype(x, np.number)).values),])

train.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,272,273,274,275,276,277,278,279,280,281
0,0.010876,0.112877,-0.052468,0.138521,-0.139108,0.009598,0.116182,-0.020836,0.368246,-0.119031,...,-0.454696,2.272362,-0.427399,-0.326158,-0.312402,-0.08286,0.0,-0.045171,-0.037836,-0.152885
1,0.010876,0.112877,-0.052468,0.138521,-0.139108,0.009598,0.116182,-0.020836,0.368246,-0.119031,...,2.199274,-0.440071,-0.427399,-0.326158,-0.312402,-0.08286,0.0,-0.045171,-0.037836,-0.179406
2,0.010876,0.112877,-0.052468,0.138521,-0.139108,0.009598,0.116182,-0.020836,0.368246,-0.119031,...,2.199274,-0.440071,-0.427399,-0.326158,-0.312402,-0.08286,0.0,-0.045171,-0.037836,-0.179406
3,0.010876,0.112877,-0.052468,0.138521,-0.139108,0.009598,0.116182,-0.020836,0.368246,-0.119031,...,-0.454696,2.272362,-0.427399,-0.326158,-0.312402,-0.08286,0.0,-0.045171,-0.037836,-0.152885
4,0.010876,0.112877,-0.052468,0.138521,-0.139108,0.009598,0.116182,-0.020836,0.368246,-0.119031,...,-0.454696,2.272362,-0.427399,-0.326158,-0.312402,-0.08286,0.0,-0.045171,-0.037836,0.536657


In [9]:
dependent = 281
X_train = train.drop(dependent, axis=1)
y_train = train[dependent][:, np.newaxis]
X_test = df1.drop(dependent, axis=1)
y_test = df1[dependent][:, np.newaxis]


m = Ridge().fit(X_train,y_train)

#m.coef_, m.intercept_, m.n_iter_

m.score(X_test, y_test)

#m.predict(X_test)



0.7510771490709747

In [10]:
thresh = np.exp(4)
scores = {k: Ridge(alpha=k).fit(X_train,y_train).score(X_test, y_test) for k in np.divide(range(0,1000, 2**4), 100)+np.divide(1,thresh)}

scores_df = pd.DataFrame.from_dict(scores, orient='index').reset_index()
scores_df.columns = ['alpha', 'R2']

C = alt.Chart(scores_df).mark_circle().encode(x='alpha', y='R2')

C

In [0]:
# doesn't seem to matter. 

In [11]:
ols = LinearRegression()
ols.fit(X_train,y_train).score(X_test,y_test) # that's nuts. 

-2.2448258545033072e+20