
WORK PLAN

In this Project:
* Import the necessary Libraries
* Load and analyse the data
* Find Correlations among the faetures
* Split the data into train and test data(validation data)
* Predict the activity using Logistic Regression and Logisctic Regression CV
* Calculate the Classification error metrics 
* Feature selection to pick the best features for the a better prediction
* Calculate the new classification error metric
* Compare 6 and 8 above to get the best model
* Conclusion and submission

I am using the Kaggle data which can be found here:

https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones/downloads/human-activity-recognition-with-smartphones.zip  

## 1 - Importing thr Necessary Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as error_metric
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.feature_selection import VarianceThreshold

## 2 - Load and analyse the data

In [None]:
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

In [None]:
train.head()

In [None]:
test.head()

In [None]:
#Check for null values
train.isnull().values.any()

In [None]:
test.isnull().values.any()

There are no null values in either the test and the train datasets

The subject column is not going to be usefull here so i will drop it from both data sets

In [None]:
train.drop('subject', axis =1, inplace=True)
test.drop('subject', axis =1, inplace=True)

In [None]:
train.head()

In [None]:
rem_cols2 = test.columns.tolist()

In [None]:
# We check the datatypes 
train.dtypes.value_counts()

In [None]:
test.dtypes.value_counts()

****Should we rescale the data? Scaling a dataset usually produces better dataset and more accurate predictions. First we check the range( the min and the max) for each of the datasets. Lets try using the .describe() method and lets exclude the activity column which is the last column. ****

In [None]:
train.describe()  #we see that the min = -1 and the max = +1. so no need for scaling

In [None]:
train.dtypes.tail()

They have the same data types. That is, mostly floats and one object feature. Lets see what the object feature is abd extract it from the rest

In [None]:
object_feature = train.dtypes == np.object
object_feature = train.columns[object_feature]
object_feature

As we can see, the only object data type in both train and the test dataset is the Activity feature. Lets take a closer look at it...

In [None]:
train.Activity.value_counts()

We need to encode the Activity column becasue sklearn won't accept sparse matrix as prediction columns . WEe will use LabelEncoder to encode the Activities 

In [None]:
le = LabelEncoder()
for x in [train, test]:
    x['Activity'] = le.fit_transform(x.Activity)
    

In [None]:
train.Activity.sample(5)

In [None]:
test.Activity.sample(5)

## 3- Finding the Correlation/ Relationships between the features

Correlation refers to the mutual relationship and association between quantities and it is generaly used to express one quantity in terns of its relationship with other quantities. The can either be Positive(variables change in the same direction), negative(variables change in opposite direction or neutral(No correlation).

Variable within a dataset can be related in lots of ways and for lost of reasons:
    - They could depend on values of other variable
    - They could be associated to each other
    - They could both depend on a thirf variable.
    
In this project, we will be using the pandas method .corr() for calculating correlation between dataframe columns

In [None]:
feature_cols = train.columns[: -1]   #exclude the Activity column
#Calculate the correlation values
correlated_values = train[feature_cols].corr()
#stack the data and convert to a dataframe

correlated_values = (correlated_values.stack().to_frame().reset_index()
                    .rename(columns={'level_0': 'Feature_1', 'level_1': 'Feature_2', 0:'Correlations'}))
correlated_values.head()

In [None]:
#create an abs_correlation column
correlated_values['abs_correlation'] = correlated_values.Correlations.abs()
correlated_values.head()

In [None]:
#Picking most correlated features
train_fields = correlated_values.sort_values('Correlations', ascending = False).query('abs_correlation>0.8')
train_fields.sample(5)

## 4 - Splitting the data into train and validation 

In [None]:
#Getting the split indexes

split_data = StratifiedShuffleSplit(n_splits = 1, test_size = 0.3, random_state = 42)
train_idx, val_idx = next(split_data.split(train[feature_cols], train.Activity))

#creating the dataframes

x_train = train.loc[train_idx, feature_cols]
y_train = train.loc[train_idx, 'Activity']

x_val = train.loc[val_idx, feature_cols]
y_val = train.loc[val_idx, 'Activity']

In [None]:
y_train.value_counts(normalize = True)

In [None]:
y_val.value_counts(normalize = True)

In [None]:
#Same ratio of classes in both the train and validation data thanks to StratifiedShuffleSPlit

## 5 - Predictive Modelling

In [None]:
lr = LogisticRegression()
lr_l2 = LogisticRegressionCV(Cs=10, cv=4, penalty='l2')
rf = RandomForestClassifier(n_estimators = 10)

lr = lr.fit(x_train, y_train)

rf = rf.fit(x_train, y_train)

lr_l2 = lr_l2.fit(x_train, y_train)


In [None]:
#predict the classes and probability  for each

y_predict = list()
y_proba = list()

labels = ['lr', 'lr_l2', 'rf']
models = [lr, lr_l2, rf]

for lab, mod in zip(labels, models):
    y_predict.append(pd.Series(mod.predict(x_val), name = lab))
    y_proba.append(pd.Series(mod.predict_proba(x_val).max(axis=1), name = lab))
    #.max(axis = 1) for a 1 dimensional dataframe

y_predict = pd.concat(y_predict, axis = 1)
y_proba = pd.concat(y_proba, axis = 1)

y_predict.head()

In [None]:
y_proba.head(10)

## 6 - Calculating the Error Metrics

In [None]:
metrics = list()
confusion_m = dict()

for lab in labels:
    precision, recall, f_score, _ = error_metric(y_val, y_predict[lab], average = 'weighted')
    
    accuracy = accuracy_score(y_val, y_predict[lab])
    
    confusion_m[lab] = confusion_matrix(y_val, y_predict[lab])
    
    metrics.append(pd.Series({'Precision': precision, 'Recall': recall,
                            'F_score': f_score, 'Accuracy': accuracy}, name = lab))
    
metrics= pd.concat(metrics, axis =1) 

In [None]:
metrics

In [None]:
fig, axList = plt.subplots(nrows=2, ncols=2)
axList = axList.flatten()
fig.set_size_inches(12, 10)

axList[-1].axis('off')

for ax,lab in zip(axList[:-1], labels):
    sns.heatmap(confusion_m[lab], ax=ax, annot=True, fmt='d');
    ax.set(title=lab);
    
plt.tight_layout()

Observation: 

We can see that the Logistic regression with L2 regularization gives slightly better error metric than the other models. In part 2 of this porject, we will look at the effect of correlation on the error metrics. The question we ask here is:

What happens when we discard the most correlated feature? do we have a better model or not?

we will discard the features whose threshold is less that 0.8 that is, features with low variance. We will be using the sklearn feature_selection method VarianceThreshold.
        

## PART 2 - Feature_selection: Discarding the Most Correlated Features

In [None]:
#Remeber>..
train_fields.sample(5)

In [None]:
#Getting the features with high Variance and split the data into train and test

low_var = VarianceThreshold(threshold=(0.8 * (1 - 0.8)))

train2 = pd.concat([x_train,x_val])
train_new = pd.DataFrame(low_var.fit_transform(train2))
                         
test_new = pd.concat([y_train,y_val])

                         
x_new,x_val_new = train_test_split(train_new)
y_new,y_val_new = train_test_split(test_new)

## Predictive Models

In [None]:
lr_new = lr.fit(x_new, y_new)

lr_l2_new = lr_l2.fit(x_new, y_new)

In [None]:
#predict the classes and probability  for each

y_predict_new = list()
y_proba_new = list()

labels_new = ['lr_new', 'lr_l2_new']
models_new = [lr_new, lr_l2_new]

for lab, mod in zip(labels_new, models_new):
    y_predict_new.append(pd.Series(mod.predict(x_val_new), name = lab))
    y_proba_new.append(pd.Series(mod.predict_proba(x_val_new).max(axis=1), name = lab))
    #.max(axis = 1) for a 1 dimensional dataframe

y_predict_new = pd.concat(y_predict_new, axis = 1)
y_proba_new = pd.concat(y_proba_new, axis = 1)

y_predict_new.head()

In [None]:
y_proba_new.head()

## Calculating the error metrics

In [None]:
metrics_new = list()
con_mat = dict()

for lab in labels_new:
    precision, recall, f_score, _ = error_metric(y_val_new, y_predict_new[lab], average = 'weighted')
    
    accuracy = accuracy_score(y_val_new, y_predict_new[lab])
    
    con_mat[lab] = confusion_matrix(y_val, y_predict[lab])
    
    metrics_new.append(pd.Series({'precision': precision, 'recall': recall,
                            'f_score': f_score, 'accuracy': accuracy}, name = lab))
    
metrics_new= pd.concat(metrics_new, axis =1) 
