### CONTEXT


GOT THE DATASET FROM KAGGLE: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

"The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKINGUPSTAIRS, WALKINGDOWNSTAIRS, SITTING, STANDING, LAYING)."

__IMPLEMENTED SUPERVISED MACHINE LEARNING ALGORITHMS FOR CLASSIFICATION__

### IMPORTING REQUIRED LIBRARIES

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.model_selection import cross_val_score

# To remove the scientific notation from numpy arrays
np.set_printoptions(suppress=True)

### READING THE DATA

In [None]:
df= pd.read_csv("../input/human-activity-recognition-with-smartphones/test.csv")

In [None]:
df.head()

### EXPLORATORY DATA ANALYSIS

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df= df.drop_duplicates()
df.shape

__No duplicate rows were found.__

In [None]:
df.isnull().sum()[df.isnull().sum()>0]

__The data has no missing values in the form of NaN.__

In [None]:
X= df.drop(columns=['Activity'])
X=X.values
from sklearn.decomposition import PCA

pca = PCA(n_components=3)

# fitting the data
pca_fit=pca.fit(X)

# calculating the principal components
reduced_X = pca_fit.transform(X)
#561 Columns present in X are now represented by 3-Principal components present in reduced_X

**Since there are 561 predictors, we are using PCA to reduce the number of predictors which will help us in visualization.**

In [None]:
df2= pd.DataFrame(reduced_X, columns=['PC1','PC2','PC3'])
df2['activity']=df['Activity']
df2.head()

### VIZUALIZING THE DISTRIBUTION OF THE COLUMNS

_Since PC1, PC2, PC3 is continuous in nature, we will use histogram to visualize it._

_For Activity, we will use bar chart because it is categorical in nature._

In [None]:
df2.hist(['PC1','PC2','PC3'],figsize=(20,5))

_None of them has extreme skewness and represent a fair distribution._

In [None]:
def bar_graph(data,predictor):
    grouped=data.groupby(predictor)
    chart=grouped.size().plot.bar(rot=0, title='Bar Chart showing the total frequency of different '+str(predictor), figsize=(15,4))
    chart.set_xlabel(predictor)

In [None]:
bar_graph(df2,'activity')

In [None]:
df2.activity.value_counts()

__The distribution of the classes is fairly balanced.__
_____________________________________________________________________________________________________________________________

### VIZUALIZING THE RELATIONSHIP BETWEEN THE PREDICTORS AND THE TARGET VARIABLE 

_Using boxplot to see the relationship between categorical target variable and continuous predictors._

In [None]:
df2.boxplot(column=['PC1'], by='activity', figsize=(15,10),grid=False, layout=(2,1))

In [None]:
df2.boxplot(column=['PC2'], by='activity', figsize=(15,5),grid=False)

In [None]:
df2.boxplot(column=['PC3'], by='activity', figsize=(15,5),grid=False)

__The mean value of different activities is varying for all the 3 boxplots. This implies that the predictors are correlated with the target variable.__

### STATISTICAL TEST FOR CORRELATION

In [None]:
def anova_test(data,target,predictor):
    data1=data.groupby(target)[predictor].apply(list)
    from scipy.stats import f_oneway
    AnovaResults = f_oneway(*data1)
    if AnovaResults[1]<0.05:
        print(str(predictor)+' is related with the target variable : ', AnovaResults[1])
    else:
        print(str(predictor)+' is NOT related with the target variable : ', AnovaResults[1])

In [None]:
anova_test(df2,'activity','PC1')

In [None]:
anova_test(df2,'activity','PC2')

In [None]:
anova_test(df2,'activity','PC3')

__We used ANOVA test to check whether the predictors are correlated with the target variable.__

### TREATING THE CATEGORICAL VARIABLE

In [None]:
df2.activity.unique()

In [None]:
activity_mapping = {'STANDING': 1,
                'SITTING': 2,
                'LAYING': 3,
              'WALKING': 4,
               'WALKING_DOWNSTAIRS': 5,
               'WALKING_UPSTAIRS':6
              }
# encoding the Ordinal variable cut
df['Activity'] = df['Activity'].map(activity_mapping)

# Checking the encoded columns
df['Activity'].unique()

__USING PCA WE SAW THAT THE PREDICTORS ARE RELATED TO THE TARGET VARIABLE. HOWEVER WE WILL NOT USE THE PCA COLUMNS FOR MODELLING PURPOSE BECAUSE IT CAN REDUCE THE ACCURACY.__

### SPLITTING THE DATASET INTO TRAINING AND TESTING

In [None]:
df.head()

In [None]:
TargetVariable='Activity'
df2=df.drop(columns=['Activity','subject'])
predictor = df2.columns
x=df[predictor].values
y =df[TargetVariable].values

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x=scaler.fit_transform(x)

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

### APPLYING DIFFERENT ALGORITHMS

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

### MODEL

_LOGISTIC REGRESSION_

In [None]:
clf = LogisticRegression(C=1)

# Creating the model on Training Data
LOG=clf.fit(x_train,y_train)
prediction=LOG.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

_K-NEAREST CLASSIFIER_

In [None]:
clf = KNeighborsClassifier(n_neighbors=3)

# Creating the model on Training Data
KNN=clf.fit(x_train,y_train)
prediction=KNN.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)


_DECISION TREE CLASSIFIER_

In [None]:
clf = DecisionTreeClassifier(max_depth=3,criterion='entropy')

# Creating the model on Training Data
DTree=clf.fit(x_train,y_train)
prediction=DTree.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(DTree.feature_importances_, index=predictor)
feature_importances.nlargest(10).plot(kind='barh')

_RANDOM FOREST CLASSIFIER_

In [None]:
clf = RandomForestClassifier(max_depth=4, n_estimators=600,criterion='entropy')

# Creating the model on Training Data
RF=clf.fit(x_train,y_train)
prediction=RF.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(RF.feature_importances_, index=predictor)
feature_importances.nlargest(10).plot(kind='barh')

_SUPPORT VECTOR MACHINE_

In [None]:
clf = SVC(C=100, gamma=0.001, kernel='rbf')

# Creating the model on Training Data
SVM=clf.fit(x_train,y_train)
prediction=SVM.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

### SAMPLING TECHNIQUES: SMOTE, OVERSAMPLING, UNDERSAMPLING

In [None]:
from imblearn.over_sampling import SMOTE
smk=SMOTE(random_state=42)
x_smote,y_smote=smk.fit_sample(x_train,y_train)
print('Resampled dataset shape %s' % Counter(y_smote))

In [None]:
from imblearn.over_sampling import RandomOverSampler
ros= RandomOverSampler(random_state=42)
x_over,y_over= ros.fit_resample(x_train,y_train)
print('Resampled dataset shape %s' % Counter(y_over))

In [None]:
from imblearn.under_sampling import RandomUnderSampler
rus= RandomUnderSampler(random_state=42)
x_under,y_under= rus.fit_resample(x_train,y_train)
print('Resampled dataset shape %s' % Counter(y_under))

_LOGISTIC REGRESSION AFTER SAMPLING_

In [None]:
clf = LogisticRegression(C=1)

# Creating the model on Training Data
LOG=clf.fit(x_smote,y_smote)
prediction=LOG.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

In [None]:
clf = LogisticRegression(C=1)

# Creating the model on Training Data
LOG=clf.fit(x_over,y_over)
prediction=LOG.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

In [None]:
clf = LogisticRegression(C=1)

# Creating the model on Training Data
LOG=clf.fit(x_under,y_under)
prediction=LOG.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

_K NEAREST CLASSIFIERS AFTER SAMPLING_

In [None]:
clf = KNeighborsClassifier(n_neighbors=3)

# Creating the model on Training Data
KNN=clf.fit(x_smote,y_smote)
prediction=KNN.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

In [None]:
clf = KNeighborsClassifier(n_neighbors=3)

# Creating the model on Training Data
KNN=clf.fit(x_over,y_over)
prediction=KNN.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)


In [None]:
clf = KNeighborsClassifier(n_neighbors=3)

# Creating the model on Training Data
KNN=clf.fit(x_under,y_under)
prediction=KNN.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)


_DECISION TREE AFTER SAMPLING_

In [None]:
clf = DecisionTreeClassifier(max_depth=3,criterion='entropy')

# Creating the model on Training Data
DTree=clf.fit(x_smote,y_smote)
prediction=DTree.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(DTree.feature_importances_, index=predictor)
feature_importances.nlargest(10).plot(kind='barh')

In [None]:
clf = DecisionTreeClassifier(max_depth=3,criterion='entropy')

# Creating the model on Training Data
DTree=clf.fit(x_over,y_over)
prediction=DTree.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(DTree.feature_importances_, index=predictor)
feature_importances.nlargest(10).plot(kind='barh')

In [None]:
clf = DecisionTreeClassifier(max_depth=3,criterion='entropy')

# Creating the model on Training Data
DTree=clf.fit(x_under,y_under)
prediction=DTree.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(DTree.feature_importances_, index=predictor)
feature_importances.nlargest(10).plot(kind='barh')

_RANDOM FOREST CLASSIFIER AFTER SAMPLING_

In [None]:
clf = RandomForestClassifier(max_depth=4, n_estimators=600,criterion='entropy')

# Creating the model on Training Data
RF=clf.fit(x_smote,y_smote)
prediction=RF.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(RF.feature_importances_, index=predictor)
feature_importances.nlargest(10).plot(kind='barh')

In [None]:
clf = RandomForestClassifier(max_depth=4, n_estimators=600,criterion='entropy')

# Creating the model on Training Data
RF=clf.fit(x_over,y_over)
prediction=RF.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(RF.feature_importances_, index=predictor)
feature_importances.nlargest(10).plot(kind='barh')

In [None]:
clf = RandomForestClassifier(max_depth=4, n_estimators=600,criterion='entropy')

# Creating the model on Training Data
RF=clf.fit(x_under,y_under)
prediction=RF.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(RF.feature_importances_, index=predictor)
feature_importances.nlargest(10).plot(kind='barh')

_SUPPORT VECTOR CLASSIFIER AFTER SAMPLING_

In [None]:
clf = SVC(C=100, gamma=0.001, kernel='rbf')

# Creating the model on Training Data
SVM_smote=clf.fit(x_smote,y_smote)
prediction=SVM_smote.predict(x_test)

# Measuring accuracy on Testing Datam
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

In [None]:
clf = SVC(C=100, gamma=0.001, kernel='rbf')

# Creating the model on Training Data
SVM_over=clf.fit(x_over,y_over)
prediction=SVM_over.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

In [None]:
clf = SVC(C=100, gamma=0.001, kernel='rbf')

# Creating the model on Training Data
SVM_under =clf.fit(x_under,y_under)
prediction=SVM_under.predict(x_test)

# Measuring accuracy on Testing Data
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

__THE BEST MODEL IS SUPPORT VECTOR CLASSIFIER WITHOUT ANY SAMPLING TECHNIQUE.__

_Accuracy: 99%_

_Error proportion: 0.013_

### K-FOLD CROSS VALIDATION

In [None]:
accuracy_values= cross_val_score(SVM_under, x, y, cv=10, scoring='f1_weighted')
print(accuracy_values)
print('Final Average Accuracy of the Model:',accuracy_values.mean())

### DEPLOYMENT OF THE MODEL

In [None]:
final_svm= SVM.fit(x,y)

In [None]:
test= pd.read_csv('../input/human-activity-recognition-with-smartphones/test.csv')

In [None]:
test.drop(columns=['subject'],inplace=True)
test=test.drop_duplicates()

In [None]:
df.isnull().sum()[df.isnull().sum()>0]

In [None]:
activity_mapping = {'STANDING': 1,
                'SITTING': 2,
                'LAYING': 3,
              'WALKING': 4,
               'WALKING_DOWNSTAIRS': 5,
               'WALKING_UPSTAIRS':6
              }
# encoding the Ordinal variable cut
test['Activity'] = test['Activity'].map(activity_mapping)

# Checking the encoded columns
test['Activity'].unique()

In [None]:
test.head()

In [None]:
TargetVariable='Activity'
test2= test.drop(columns=['Activity'])
predictor = test2.columns
x_test= test[predictor].values
y_test = test[TargetVariable].values

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x=scaler.fit_transform(x_test)

prediction= final_svm.predict(x)
test['Activity_Predictions']=prediction
test.head()

In [None]:
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(y_test, prediction))

# Printing the Overall Accuracy of the model
F1_Score=metrics.classification_report(y_test, prediction).split()[-2]
print('Accuracy of the model:', F1_Score)

#### ACCURACY OF TRAIN SET : 99%

#### ACCURACY OF TEST SET : 100%

_MODEL: SUPPORT VECTOR CLASSIFIER AFTER UNDERSAMPLING (Because it has the highest accuracy and lowest error percentage)_

_STANDARDIZED the TRAIN SET_
