#### Hi, welcome to my project! Today we will be using Logistic Regression Classifier algorithms to predict human activities.
#### We will use the Human Activity Recognition with Smartphones database, which was built from the recordings of study participants performing activities of daily living (ADL) while carrying a smartphone with an embedded inertial sensors. The objective is to classify activities into one of the six activities (walking, walking upstairs, walking downstairs, sitting, standing, and laying) performed.

For each record in the dataset it is provided:

Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration.
Triaxial Angular velocity from the gyroscope.
A 561-feature vector with time and frequency domain variables.
Its activity label.

In [None]:
import seaborn as sns, pandas as pd, numpy as np

## Reading our csv file:

In [None]:
filepath = '../input/logistic-regression/Human_Activity_Recognition_Using_Smartphones_Data.csv'
data = pd.read_csv(filepath, sep=',')

In [None]:
data

## Exploring our data:

In [None]:
data.dtypes.value_counts()

The data columns are all floats except for the activity label.

In [None]:
data.iloc[:,:-1].min().value_counts()

In [None]:
data.iloc[:,:-1].max().value_counts()

The data are all scaled from -1 (minimum) to 1.0 (maximum).

Examine the breakdown of activities--they are relatively balanced.

In [None]:
data.Activity.value_counts()

In [None]:
data.Activity.value_counts(normalize=True)

Scikit learn classifiers won't accept a sparse matrix for the prediction column. Thus, either LabelEncoder needs to be used to convert the activity labels to integers, or if DictVectorizer is used, the resulting matrix must be converted to a non-sparse array.
We are going to use LabelEncoder to fit_transform the "Activity" column, and look at 5 random values.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data['Activity'] = le.fit_transform(data.Activity)
data['Activity'].sample(5)

In [None]:
le.inverse_transform([0,1,2,3,4,5])      # Only to know which one corresponds to each number

In [None]:
le.transform(['STANDING'])

### Let's calculate the correlations between each column:

In [None]:
feature_cols = data.columns[:-1]
corr_values = data[feature_cols].corr()

# Simplify by emptying all the data below the diagonal
tril_index = np.tril_indices_from(corr_values)

# Make the unused values NaNs
for coord in zip(*tril_index):
    corr_values.iloc[coord[0], coord[1]] = np.NaN
    
# Stack the data and convert to a data frame
corr_values = (corr_values
               .stack()
               .to_frame()
               .reset_index()
               .rename(columns={'level_0':'feature1',
                                'level_1':'feature2',
                                0:'correlation'}))

# Get the absolute values for sorting
corr_values['abs_correlation'] = corr_values.correlation.abs()

In [None]:
len(corr_values)   # Number of rows in the tuple (table) is: n(n-1)/2

A histogram of the absolute value correlations:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
sns.set_context('talk')
sns.set_style('white')

ax = corr_values.abs_correlation.hist(bins=50, figsize=(12, 8))
ax.set(xlabel='Absolute Correlation', ylabel='Frequency');

In [None]:
# The most highly correlated values
corr_values.sort_values('correlation', ascending=False).query('abs_correlation>0.9')
### END SOLUTION

# Splitting our dataset:
This can be done using any method, but for this project we will use Scikit-learn's StratifiedShuffleSplit to maintain the same ratio of predictor classes.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# Get the split indexes
strat_shuf_split = StratifiedShuffleSplit(n_splits=1, 
                                          test_size=0.3, 
                                          random_state=42)

train_idx, test_idx = next(strat_shuf_split.split(data[feature_cols], data.Activity))

# Create the dataframes
X_train = data.loc[train_idx, feature_cols]
y_train = data.loc[train_idx, 'Activity']

X_test  = data.loc[test_idx, feature_cols]
y_test  = data.loc[test_idx, 'Activity']

Let's see the ratio between the classes in our y-test only to compare with the original dataset

In [None]:
y_test.value_counts(normalize=True)

# Building Logistic Regression models:

#### Firstly let's fit a logistic regression model without any regularization using all of the features. 

#### Then we will use cross validation to determine the hyperparameters fitting models using L1 and L2 regularization.

In [None]:
from sklearn.linear_model import LogisticRegression 

#Standard logistic regression:
lr = LogisticRegression(solver='liblinear').fit(X_train,y_train)

In [None]:
from sklearn.linear_model import LogisticRegressionCV

# L1 regularized logistic regression
lr_l1 = LogisticRegressionCV(Cs=10, cv=4, penalty='l1', solver='liblinear').fit(X_train, y_train)

In [None]:
# L2 regularized logistic regression
lr_l2 = LogisticRegressionCV(Cs=10, cv=4, penalty='l2', solver='liblinear').fit(X_train, y_train)

### Comparing the magnitudes of the coefficients for each model. As one-vs-rest fitting was used, each set of coefficients can be plotted separately.

Just to have an idea how this looks like, lets see the coefficients for "lr" model and transpose it:

In [None]:
pd.DataFrame(lr.coef_).T 

In [None]:
# Now we are going to combine all the coefficients into a dataframe
coefficients = list()

coeff_labels = ['lr', 'l1', 'l2']
coeff_models = [lr, lr_l1, lr_l2]

for lab,mod in zip(coeff_labels, coeff_models):
    coeffs = mod.coef_
    coeff_label = pd.MultiIndex(levels=[[lab], [0,1,2,3,4,5]], 
                                 codes=[[0,0,0,0,0,0], [0,1,2,3,4,5]])
    coefficients.append(pd.DataFrame(coeffs.T, columns=coeff_label))

coefficients = pd.concat(coefficients, axis=1)

coefficients.head(10)

As we know from logistic regression when we have multiclass label the model will find a set of coefficients for each class, due to the "one vs rest method". Thus, each model contain 6 sets of coefficients which differ from one another. 

In the following step I'm going to focus on each class, so I will plot the sets of coefficients obtained for the first class and labeling by color the 3 models, then apply the same for second class and so on and so forth.

### Displaying six separate plots for each of the multi-class coefficients:

In [None]:
fig, axList = plt.subplots(nrows=3, ncols=2)
axList = axList.flatten()
fig.set_size_inches(12,12)

for loc, ax in enumerate(axList):
    data = coefficients.xs(loc, level=1, axis=1)
    data.plot(marker='o', ls='', ms=2.0, ax=ax, legend=False)
    
    if ax is axList[0]:
        ax.legend(loc=4)
        
    ax.set(title='Coefficient Set '+str(loc))

plt.tight_layout()

### Predict and store the class for each model.
Store the probability for the predicted class for each model.

In [None]:
# Predict the class and the probability for each
y_pred = list()
y_prob = list()

coeff_labels = ['lr', 'l1', 'l2']
coeff_models = [lr, lr_l1, lr_l2]

for lab,mod in zip(coeff_labels, coeff_models):
    y_pred.append(pd.Series(mod.predict(X_test), name=lab))
    y_prob.append(pd.Series(mod.predict_proba(X_test).max(axis=1), name=lab))
    
y_pred = pd.concat(y_pred, axis=1)
y_prob = pd.concat(y_prob, axis=1)

y_pred.head()

In [None]:
y_prob.head()

Below we are displaying all rows which have different predicted classes by models "lr != l1", We can see this for all columns in a better way ploting confusion matrix.

In [None]:
y_pred[y_pred['lr']!=y_pred['l1']] 

# Computing error metrics:

We could see in detail the error metrics for the 3 models by using classification report:

In [None]:
from sklearn.metrics import classification_report
print('Classification report for Logistic regression without regularization:')
print(classification_report(y_test,y_pred['lr']))

print('Classification report for Logistic regression with L1(Lasso) regularization:')
print(classification_report(y_test,y_pred['l1']))
    
print('Classification report for Logistic regression with L2(Ridge) regularization:')
print(classification_report(y_test,y_pred['l2']))

In order to summarize and average the values obtained we will do the following:

In [None]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.preprocessing import label_binarize

metrics = list()
cm = dict()

for lab in coeff_labels:

    # Precision, recall, f-score from the multi-class support function, we will average them because we will have one value per class
    precision, recall, fscore, _ = score(y_test, y_pred[lab], average='weighted')
    
    # The usual way to calculate accuracy
    accuracy = accuracy_score(y_test, y_pred[lab])
    
    # ROC-AUC scores can be calculated by binarizing the data
    auc = roc_auc_score(label_binarize(y_test, classes=[0,1,2,3,4,5]),
              label_binarize(y_pred[lab], classes=[0,1,2,3,4,5]), 
              average='weighted')
    
    # Last, the confusion matrix
    cm[lab] = confusion_matrix(y_test, y_pred[lab])
    
    metrics.append(pd.Series({'precision':precision, 'recall':recall, 
                              'fscore':fscore, 'accuracy':accuracy,
                              'auc':auc}, 
                             name=lab))

metrics = pd.concat(metrics, axis=1)

In [None]:
metrics

**In the table above we don't see a difference statistically significant in the error metrics for the 3 models, even just using the first one (without regularization) we can expect to perform great at predicting the activities**. In order to see a difference a bit more highlighted we could plot their corresponding confusion matrix as following:

# Displaying the confusion matrix for each model:

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=cm['lr'], display_labels=lr.classes_)
disp.plot(cmap='Blues')

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=cm['l1'], display_labels=lr.classes_)
disp.plot(cmap='Blues')

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=cm['l2'], display_labels=lr.classes_)
disp.plot(cmap='Blues')

In [None]:
le.inverse_transform([1,2])

We can infer from the 3 confusion matrix that every model has a slight problem at predicting classes 1 and 2, these correspond to activities 'SITTING' and 'STANDING' respectively, we suppose the root of this is due to the similarity in measurement of angles or position.