# Feature and model selection - UCI ML BlogFeedback dataset

## Overview

This project performs a feature and model selection on the UCI machine learning `BlogFeedback` dataset. The methods used include ridge, lasso, and elastic net regressions. Given the performance metrics, 42 features are selected from 280 features via Lasso regression.

## Data description

The dataset include 281 variables (280 features and 1 target variable). 

This data originates from blog posts. The prediction task associated with the data is the prediction 
of the number of comments in the upcoming 24 hours. In order to simulate this situation, a basetime (in the past) is chosen and blog posts are selected that were published at most 72 hours before the selected base date/time. Then, all the features of the selected blog posts from the information were calculated, and each instance correspond to a blog post. The target is the number of comments that the blog post received in the next 24 hours relative to the basetime. 

In the train data, the basetimes were in the years 2010 and 2011. In the test data the basetimes were in February and March 2012 and combined into one test set `test.csv`.

## EDA

First, we load the required libraries and read in the training set and do some simple explorations.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.metrics import accuracy_score, log_loss
%matplotlib inline

In [5]:
train = pd.read_csv("../data/train/blogData_train.csv", header=None)
train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,271,272,273,274,275,276,277,278,279,280
0,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.0


#### Check for NA values in the train data

Let's check if we need to deal with any missing values. The results shows that there are no missing values in the train set.

In [10]:
print("Number of missing values is",train.isna().sum().sum())

Number of missing values is 0


#### Summary statistics of the data

The following table give the summary statistics of each column of the trainset.

In [6]:
train.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,271,272,273,274,275,276,277,278,279,280
count,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,...,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0
mean,39.444167,46.806717,0.358914,339.853102,24.681661,15.214611,27.959159,0.002748,258.66603,5.829151,...,0.171327,0.162242,0.154455,0.096151,0.088917,0.119167,0.0,1.242094,0.769505,6.764719
std,79.121821,62.359996,6.840717,441.430109,69.598976,32.251189,38.584013,0.131903,321.348052,23.768317,...,0.376798,0.368676,0.361388,0.2948,0.284627,1.438194,0.0,27.497979,20.338052,37.706565
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.285714,5.214318,0.0,29.0,0.0,0.891566,3.075076,0.0,22.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,10.63066,19.35312,0.0,162.0,4.0,4.150685,11.051215,0.0,121.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,40.30467,77.44283,0.0,478.0,15.0,15.998589,45.701206,0.0,387.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1122.6666,559.4326,726.0,2044.0,1314.0,442.66666,359.53006,14.0,1424.0,588.0,...,1.0,1.0,1.0,1.0,1.0,136.0,0.0,1778.0,1778.0,1424.0


Also, here below is the PCA of the train set. However, we cannot see any specific groups or patterns from the PCA plot.

![](../results/pca.png)

## Feature selection

Here we perform the features selection using three regresson models, which are Ridge, Lasso, and Elastic Net. The results are summarized in the following table.

### Summary Table

| method | Feature selected | R-squared | MSE|
| ---| --- | --- |---| 
| Ridge |  276 | 0.3591|911.06|
| Lasso | 42 | 0.3595| 910.62|
| Elastic Net| 76| 0.3600 | 909.88|

Given that all three models give similar results in terms of `R-squared` and `MSE`, all models are fairly similar in terms of goodness of fit. However, I would prefer the models with fewer features, which in this case are Lasso and Elatic Net. So, I will use these two models to test on the test set.

## Test on test set

### Summary Table based on test set

Then we test the Lasso model and Elastic Net model on the test set. The results are summarized below.

| method | Feature selected | R-squared | MSE|
| ---| --- | --- |---| 
| Lasso | 42 | 0.3145| 637.66|
| Elastic Net| 76| 0.3137 | 638.34|

Given the results from the test set, two models are fairly equivalent in terms of fitting. Lasso achieve a slightly lower `R-squared` and `MSE` with less features required. Thus, I would select the relevant features based on the Lasso model.

The corresponding indexes of the featues selected are shown below, which can be referred back in the [data-attribute-description](../data/data-attribute-description.md) file.

![](../results/selected.png)

SInce the performance scores are not good for the current models. Future analysis can be continued with different methods on the selected features above. Other methods may include random forest, SVM and so on.