# Train dataset

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# reading the dataset
train = pd.read_csv('../input/human-activity-recognition-with-smartphones/train.csv')
train.head(10)

In [None]:
# shape of the dataset
train.shape

# Test dataset

In [None]:
test = pd.read_csv('../input/human-activity-recognition-with-smartphones/test.csv')
test.head(10)

In [None]:
#shape of test dataset
test.shape

While predicting the result of our model we will not use test dataset

# Data Cleaning

## 1. Check for duplicates

In [None]:
#checking for duplicate values in train dataset- if there are any duplicate values we will remove them
print("Number of duplicate values in train dataset are", train.duplicated().sum())
print("Number of duplicate values in test dataset are", test.duplicated().sum()) 

As we can see that there are no duplicate values in train and test dataset

## 2. Check for null values

In [None]:
#checkng for null values - if there are any null values then we will remove them using simpleimputer with mean or median strategy
print("Number of null values in train dataset are ", train.isna().sum().sum())
print("Number of null values in test dataset are ", test.isna().sum().sum())

As we can see that there are no null values in train and test dataset

## 3. check for outliers

### train dataset

In [None]:
#checking for outliers, if there are any outliers we will remove then or we will use median() to fill them

# 1. Identify the outliers with inter quntile range(IQR)
q1 = train.quantile(0.10)
q3 = train.quantile(0.90)
iqr = q3 - q1
#printing the iqr score which will help us to detect outliers
print(iqr)

In [None]:
#2. printing the number of outliers
outlier = ((train < (q1 - 1.5*iqr)) | (train > (q3 + 1.5*iqr))).values.sum()
print("number of outliers are ", outlier)

*Ignore the error for now*

We can see that there are some outliers, and we will adjust them when we want to predict the result, which we won't do in this notebook

### test dataset

In [None]:
# 1. Identify the outliers with inter quntile range(IQR)
q1 = test.quantile(0.10)
q3 = test.quantile(0.90)
iqr = q3 - q1
#printing the iqr score which will help us to detect outliers
print(iqr)

In [None]:
#2. printing the number of outliers
outlier = ((test < (q1 - 1.5*iqr)) | (test > (q3 + 1.5*iqr))).values.sum()
print("number of outliers are ", outlier)

Ignore the error for now

We can see that there are some outliers, and we will adjust them when we want to predict the result, which we won't do in this notebook

# Exploratory data analysis - EDA

## 1. description of dataset

In [None]:
#description of train dataset
train.describe()

As we all already know, there are 7352 rows with 562 columns. With description method you can see mean, std, min, max, etc, values for each column

In [None]:
#description of test dataset
test.describe()

As we all already know, there are 2947 rows with 562 columns. With description method you can see mean, std, min, max, etc, values for each column

## 2. visualize user data

In [None]:
# visualizing the use user data in train
plt.figure(figsize=(16,8))
sns.countplot(x='subject', hue='Activity', data=train)
plt.title("User data", fontsize=20)
plt.show()

## 3. Activity and subject data points

In [None]:
sns.countplot(x='Activity', data=train)
plt.title("Activity data points", fontsize=20)
plt.xticks(rotation=90)
plt.show()

There are 1400 data points in which lowest data point is walking downstairs. We can say that **walking downstairs** acitivity was **less performered** and highest data points for **Laying** activity, which was **mostly performed.**

In [None]:
sns.countplot(x='subject', data=train)
plt.title("Subject data points", fontsize=20)
plt.xticks(rotation=90)
plt.show()

From above data points we can say that most of the subjects performed a same amount of movement(activity). Though few subjects did more than others.

No need to check it for test dataset

In [None]:
test.Activity.unique()

## 4. Dropping some columns

Now we will drop **subject** column because it contains string values and we will drop **Activity** column because it is **dependent column.**

In [None]:
train_PCA = train.drop(['Activity', 'subject'], axis=1)

# PCA

## 1. PCA projecton to 2D

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(train_PCA)

We imported PCA from sklearn.decomposition. n_components = 2, because we wanted to convert our dataset into 2D array. 

## 2. Storing 2D values in new dataframe

In [None]:
humanActivity = pd.DataFrame({'x':principalComponents[:,0], 'y':principalComponents[:,1] ,'label':train['Activity']})
humanActivity

After storing our principalComponant into humanAcivity dataframe we added labels into the dataset. They will help us to visualize the 2D dataset

## 3. Visualize 2D Projection

In [None]:
sns.lmplot(data=humanActivity, x='x', y='y', hue='label', fit_reg=False, height=8, markers=['^','v','s','o', '1','2'])
plt.show()

* We can't seprate our activities.

* Model will probably be confused, from where to seprate the dataset. 

## 4. Variance

In [None]:
pca.explained_variance_ratio_

By using the attribute **explained_variance_ratio_**, you can see that the first principal component contains 62.55% of the variance and the second principal component contains 4.91% of the variance. Together, the two components contain 6% of the information.