# Please do Vote up if you liked my work

My [Linkedin](https://www.linkedin.com/in/letian-dai-phd-physics-nanomaterial-nanoscience-nanotechnology-datascience-bigdata/) <br>
My [Git](https://github.com/daiwofei)

Sleep Cycle (SC; NorthCube,Gothenburg,Sweden) is a mobile-phone application that is avaiable on android-based as well as iOS-based devices. SC is a smart alarm-clock that tracks your sleep patterns and wakes you up dring light sleep. SC tracks sleep throughout the night and use a 30-minute window that ends up with the desired alarm time during which the alarm goes off at the lightest possible stage (i.e., light sleep). SC scores sleep through motion detection via one of two motion-detection modes: (i) microphone, which used the built-in microphone to analyze movements, or (ii) accelerometer, which uses the phone's built-in accelerometer. SC tracks movements through the night and uses them to detect and score sleep as well as to plot a graph (hypnogram). By selecting the microphone option to monitor sleep, the SC application uses sound analysis to identify sleep phrases by tracking movements in bed. The SC application uses the smartphone's built-in microphone to pick up sounds from the sleeper. After receiving the sound input, the application then filters the sound using a series of high and low cut-off filters to identify specific noises that correlate with movement. When there is no motion, the application registers deep sleep; when there is little motion, it registers light sleep; when there is a lot of motion, it registers wakefulness. The algorithm SC uses for sleep scoring is not available to the public. Through this method, it is possible to extract 30-s-epoch information about sleep scoring (wake/light sleep/deep sleep) which was also used for measuring sleep parameters. Time in bed (TiB) was calculated based on the "went to sleep" and waking times reported by the application. Total sleep time (TST) was calculated by subtracting all the "awake" epochs from the TiB. Sleep onset latency (SOL) was calculated by summing up all the awake 30-s-epochs before the first light-sleep epoch. Wake after sleep onset (WASO) was calculated by summing up all the awake epochs that lie between the first light-sleep epoch and the "woke up" time. Sleep efficiency SE = (TST/TiB) x 100. (([source from site](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6806072/))

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1 - Input the data

In [None]:
# This CSV is separated by the "delimiter=;"
df = pd.read_csv('/kaggle/input/sleep-data/sleepdata.csv',delimiter=";")
df

We notice that there are many missing data (NaN) in the CSV file. We need to find how many percentage of missing data there are.

In [None]:
df.info()

In [None]:
import seaborn as sns
#check the null part in the whole data set, red part is missing data
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='coolwarm')

There are missing data in the features "Wake up", "Sleep Notes" and "Heart rate". And we can notice that the feature of "Time in bed" is counted from the features "Start" and "End". As we know, the quality of sleep not only depends on the duration of sleep "Time in bed", but also depends on the moment you sleep "Start". <br>

The next step is to convert the "Start" and "End" to timestamp. 

In [None]:
import time
import datetime

df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])

# The feature "Time in bed" is counted from df['End'] - df['Start']. We can convert it with unit of seconds.

In [None]:
df['Time in bed'] = df['End'] - df['Start']
df['Time in bed'] = df['Time in bed'].astype('timedelta64[s]')

# The sleep quality need to be converted from *string* to *float* type

In [None]:
df['Sleep quality'] = df['Sleep quality'].apply(lambda x: np.nan if x in ['-'] else x[:-1]).astype(int)

I am intereted to the moment of falling down in the bed. For example,"2014-12-29 22:57:49" is "22:57:49", which is ***82669 second*** in a day (***95.68 %*** of a day - percentage in a day)

In [None]:
df['Start time'] = pd.Series([val.time() for val in df['Start']])
df['End time'] = pd.Series([val.time() for val in df['End']])

In [None]:
df['Start time in second'] = df['Start time'].apply(lambda x: (x.hour*60+x.minute)*60 + x.second)
df['End time in second'] = df['End time'].apply(lambda x: (x.hour*60+x.minute)*60 + x.second)

We can try to find the correlation between the features non-null

In [None]:
import matplotlib.pyplot as plt
# visualisation of this correlation
fig = plt.figure(figsize = (12,10))
r = sns.heatmap(df.corr(),cmap='Oranges')
# set title
r.set_title('Correlation')

# Let's check the correlations of features to the "sleep quality"

In [None]:
df.corr()['Sleep quality'].sort_values(ascending = False)

# It is clear to see that the "Time in bed" is the most related to the "Sleep quality" except itself. The "start time" of sleep time is more related to the "end time". 

## The next steps are dealing with the missing data of features "Wake time", which could affect the quality of sleep.
Because of the heat, the urgent urination or the thirsty during the night, we would wake up to deal with them. Frequent waking up will inevitably lead to a decline in the quality of sleep at night. The source of data only give two cases ":)", ":|" and ":(". I can't be 100% sure about the meaning of these symbols. So I just convert these two symbols to 0 for ":)" ,1 for ":|" and 2 for ":(". 

So, I am looking for the reference about this device provided by ***Northcobe***. I found an article about this topic from the [link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6806072/). They compared different methods including commercially available sleep trackers, namely an activity tracker: Mi Band (Xiaomi,Beijing,China); a scientific actigraph: Motionwatch 8 (CamNTech,Cambridge,UK) and a much-used mobile phone application: Sleep Cycle (***Northcube***,Gothenburg,Sweden). 

From the Discussion part in [Kaggle site](https://www.kaggle.com/danagerous/sleep-data/discussion/80305), Jovita Tamulyte provided some useful information about these symbols. In her opinion, it could be your subjective feeling mood in the morning: <br>
:) - you feeling great after night (good) <br>
:| - not so good (average)<br>
:( - terrible night (bad)<br>

In [None]:
# So we can replace these symbols with positive and negative number 
df['Wake up'] = df['Wake up'].replace({':)':2, ':|':1, ':(':0})

In [None]:
df2 = df[["Sleep quality", "Wake up", "Time in bed", "Start time in second", "End time in second","Activity (steps)"]]

In [None]:
# Drop the NaN elements
df2 = df2.dropna()

In [None]:
# convert the type from object to interger
df2['Wake up'] = df2['Wake up'].astype('int')

In [None]:
# Let's check the correlations of features to the "sleep quality"
df2.corr()['Sleep quality'].sort_values(ascending = False)

# The next step is to analyze the feature "Heart rate", which has the most missing data. 
## Unfortunately, I have no idea what does this "heart rate" feature mean? Does it the average heart rate during the sleep or the heart rate at the moment of waking up.

In [None]:
df3 = df[["Sleep quality", "Wake up","Heart rate","Time in bed", "Start time in second", "End time in second"]]

In [None]:
# Drop the NaN elements
df3 = df3.dropna()
df3['Wake up'] = df3['Wake up'].astype('int')

In [None]:
# Let's check the correlations of features to the "sleep quality"
df3.corr()['Sleep quality'].sort_values(ascending = False)

In [None]:
# visualisation of this correlation
fig = plt.figure(figsize = (12,10))
r = sns.heatmap(df3.corr(),cmap='Oranges')
# set title
r.set_title('Correlation')

## The "Sleep quality" is most affected by "Time in bed". 

# 2 - Explore Data Analysis

In [None]:
# Pairplot
sns.pairplot(df2, hue='Wake up')

In [None]:
# Joint plot of features "Sleep quality" and "Time in bed" with unit second.
sns.jointplot(x='Sleep quality',y='Time in bed',data=df,color='blue',kind = 'kde')

In [None]:
# The average of "Time in bed"

print ('The average time in bed of these users is :', df['Time in bed'].mean(), 'second')
print ('The average time in bed of these users is :', df['Time in bed'].mean()/3600, 'hour')

In [None]:
# The Histogram of Start time and End time
plt.figure(figsize=(10,6))
df['Start in hour'] = df['Start time in second'].apply(lambda x: x/3600)
df['End in hour'] = df['End time in second'].apply(lambda x: x/3600)
df['Start in hour'].hist(alpha=0.5,color='blue',label='Start Time',bins=50)
df['End in hour'].hist(alpha=0.5,color='red',label='End Time',bins=50)
plt.legend()
plt.xlim((0, 24)) 
plt.xticks(np.arange(0, 25, 1))
plt.xlabel('Hour in a day')
plt.ylabel('Count')

In [None]:
# The Histogram of Steps
plt.figure(figsize=(10,6))
df['Activity (steps)'].hist(alpha=0.5,color='green',label='Steps',bins=50)
plt.legend()

plt.xlabel('Steps')
plt.ylabel('Count')

## From the histogram above, the "steps" is not the "steps" during the sleep, but the "steps" during the day, which represent the activity during the day. 

In [None]:
# Joint plot of features "Sleep quality" and "Activity" with unit second.
sns.jointplot(x='Sleep quality',y='Activity (steps)',data=df,color='red',kind = 'kde')

In [None]:
# Drop the non-meaning value of steps (0)

df_new = df[df['Activity (steps)'] != 0]
df_new

In [None]:
# Let's check the correlations of features to the "sleep quality"
df_new.corr()['Sleep quality'].sort_values(ascending = False)

In [None]:
# Scatter plot
plt.figure(figsize=(10,6))
plt.scatter(df_new['Sleep quality'],df_new['Activity (steps)'], c="g", alpha=0.5, marker=r'$\clubsuit$',
            label="Sleep quality vs. Steps")
plt.xlabel("Sleep quality")
plt.ylabel("Steps during the day")
plt.legend(loc='upper left')
plt.show()

If we only look at the correlation between the "Sleep quality" and "Activity (steps)", they are not that related. According to experience, it is generally believed that a large amount of activity during the day will result in improved sleep quality. However, we didn't see that relation here.

# 3 - Machine Learning

## Train and test split

In [None]:
# We use features of "Time in bed","Start time in second", "End time in second" and "Activity (steps)" to predict the feature "Sleep quality"
# We choose to use df
X = df[['Time in bed', 'Start time in second','End time in second','Activity (steps)']].values
y = df['Sleep quality'].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

In [None]:
# In order to normalize the features, it is better to use MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## 3.1 - LinearRegression Model

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm.score(X_test,y_test)

print('test accuracy:', lm.score(X_test,y_test))

## 3.2 - KNN (K nearest neighbors) model

In [None]:
from sklearn.neighbors import KNeighborsClassifier

error_rate =[]
for i in range(1,20):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(10,6))
plt.plot(range(1,20),error_rate, color ='red',linestyle='dashed',marker='v',
        markerfacecolor = 'blue', markersize=10)
plt.title('Error Rate vs. K value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
knn = KNeighborsClassifier(n_neighbors=14) # why 5 is because of Elbow method
knn.fit(X_train,y_train)

In [None]:
print('test accuracy:', knn.score(X_test,y_test))

## 3.3 - Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)
print('test accuracy:', logmodel.score(X_test,y_test))

## 3.4 - Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)
print('test accuracy:', dtree.score(X_test,y_test))

## 3.5 - Random Tree Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=20)
rfc.fit(X_train,y_train)
print('test accuracy:', rfc.score(X_test,y_test))

# 3.6 - Support Machine Vector (SVM) Algorithm

Support vector machines (SVMs) are a set of supervised machine learning methods used for classification, regression and outlier detection.

The advantages of support vector machines are :

• Effective in high dimentional spaces.
• Still effective in cases where number of dimensions is greater than the number of samples.
• Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
• Versatile: different kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines are : 

• If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial. 
• SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross validation. 

[source from site](https://scikit-learn.org/stable/modules/svm.html)



In [None]:
# First SVM model
from sklearn.svm import SVC
svm=SVC(random_state=101)
svm.fit(X_train, y_train)
print('train accuracy:', svm.score(X_train,y_train))
print('test accuracy:', svm.score(X_test,y_test))

# Reduce the unnecessary features to improve estimators' accuracy scores then apply gridsearch method

SelectKBest: removes all but the highest scoring features

For classification generally these methods are used: chi2, f_classif, mutual_info_classif

$\textbf{chi2}$: Computes chi-squared stats between each non-negative feature and class. This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

$\textbf{f_classif}$: Compute the ANOVA F-value for the provided sample.

$\textbf{mutual_info_classif}$: Estimates mutual information for a discrete target variable. Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

[source from site](https://scikit-learn.org/stable/modules/feature_selection.html)

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
accuracy_list_train=[]
k=np.arange(1,5,1)
for each in k:
    x_new = SelectKBest(f_classif, k = each).fit_transform(X_train,y_train)
    svm.fit(x_new, y_train)
    accuracy_list_train.append(svm.score(x_new,y_train))
    
plt.plot(k, accuracy_list_train, color='green', label='train')
plt.xlabel('k values')
plt.ylabel('train accuracy')
plt.legend()
plt.show()

In [None]:
d = {'best features number': k, 'train_score': accuracy_list_train}
df3 = pd.DataFrame(data=d)
print ('max accuracy:', df3['train_score'].max())
print ('max accuracy id:', df3['train_score'].idxmax())

In [None]:
# To sum up,
print ('Using the normalisation preprocessing: \n'
    'Linear Regresion Model precision:',lm.score(X_test,y_test),'\n',
    'KNN Model precision:', knn.score(X_test,y_test),'\n',
      'Logistic Regression Model precision:',logmodel.score(X_test,y_test),'\n',
      'Decision Tree Model precision:', dtree.score(X_test,y_test),'\n',
      'Random Tree Model precision:', rfc.score(X_test,y_test),'\n',
      'Support Machine Vector precision:', svm.score(X_test,y_test))

The best machine learning model for this case is Linear Regression Model with the accuracy about 41%. 

Because Kaggle supports the Machine Learning Engine (AutoML) from Google Cloud Platform. We can try if we can improve the accuracy via the Google ***AutoML***. There is a tutorial about how to use Google AutoML in kaggle from the [link](https://www.kaggle.com/devvret/automl-tables-tutorial-notebook).

# Step 2: Initialize the clients and move your data to GCS

In [None]:
#REPLACE THIS WITH YOUR OWN GOOGLE PROJECT ID
PROJECT_ID = 'optimal-chimera-279914'
#REPLACE THIS WITH A NEW BUCKET NAME. NOTE: BUCKET NAMES MUST BE GLOBALLY UNIQUE
BUCKET_NAME = 'optimal-chimera-279914'
#Note: the bucket_region must be us-central1.
BUCKET_REGION = 'us-central1'

From there, we'll use our account with the AutoML and GCS libraries to initialize the clients we can use to do the rest of our work. The code below is boilerplate you can use directly, assuming you've entered your own PROJECT_ID and BUCKET_NAME in the previous step.

In [None]:
from google.cloud import storage, automl_v1beta1 as automl

storage_client = storage.Client(project=PROJECT_ID)
tables_gcs_client = automl.GcsClient(client=storage_client, bucket_name=BUCKET_NAME)
automl_client = automl.AutoMlClient()
# Note: AutoML Tables currently is only eligible for region us-central1. 
prediction_client = automl.PredictionServiceClient()
# Note: This line runs unsuccessfully without each one of these parameters
tables_client = automl.TablesClient(project=PROJECT_ID, region=BUCKET_REGION, client=automl_client, gcs_client=tables_gcs_client, prediction_client=prediction_client)

In [None]:
# Create your GCS Bucket with your specified name and region (if it doesn't already exist)
bucket = storage.Bucket(storage_client, name=BUCKET_NAME)
if not bucket.exists():
    bucket.create(location=BUCKET_REGION)

In order to actually move my local files to GCS, I've copied over a few helper functions from another helpful tutorial Notebook on moving Kaggle data to GCS.

In [None]:
def upload_blob(bucket_name, source_file_name, destination_blob_name):
    """Uploads a file to the bucket. https://cloud.google.com/storage/docs/ """
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)
    blob.upload_from_filename(source_file_name)
    print('File {} uploaded to {}.'.format(
        source_file_name,
        destination_blob_name))
    
def download_to_kaggle(bucket_name,destination_directory,file_name,prefix=None):
    """Takes the data from your GCS Bucket and puts it into the working directory of your Kaggle notebook"""
    os.makedirs(destination_directory, exist_ok = True)
    full_file_path = os.path.join(destination_directory, file_name)
    blobs = storage_client.list_blobs(bucket_name,prefix=prefix)
    for blob in blobs:
        blob.download_to_filename(full_file_path)

In [None]:
# get the clean data without NaN
df_clean = df[['Time in bed', 'Start time in second','End time in second','Activity (steps)','Sleep quality']]

In [None]:
#rename the columns by removing the space ' '
df_clean.columns = ['Timeinbed','Starttimeinsecond','Endtimeinsecond','Activity','Sleepquality']
df_clean

In [None]:
# Randomly split the data into train and test set including the features and prediction
train_set = df_clean.sample(frac=0.75, random_state=0)
test_set = df_clean.drop(train_set.index)

In [None]:
# add a new column named 'ID'
train_set['ID'] = np.arange(1,666)
test_set['ID'] = np.arange(666,888)

In [None]:
train_set = train_set.set_index(np.arange(1,666))
train_set

In [None]:
test_set = test_set.set_index(np.arange(666,888))
test_set

In [None]:
# Any results you write to the current directory are saved as output.
# Write the dataframes back out to a csv file, which we can more easily upload to GCS. 
train_set.to_csv(path_or_buf='train.csv', index=False)
test_set.to_csv(path_or_buf='test.csv', index=False)

Now I just run those functions on my train.csv and test.csv files saved locally and all my data is in the right place within Google Cloud Storage.

In [None]:
upload_blob(BUCKET_NAME, 'train.csv', 'train.csv')
upload_blob(BUCKET_NAME, 'test.csv', 'test.csv')

# Step 3: Train an AutoML Model

I'll break down the training step for AutoML into three operations:

+ Importing the data from your GCS bucket to your autoML client
+ Specifying the target you want to predict on your dataset
+ Creating your model


## Importing from GCS to AutoML

The first step is to create a dataset within AutoML tables that references your saved data in GCS. This is relatively straight forward, first just simply choose a name for your dataset.

In [None]:
dataset_display_name = 'sleep_quality'
new_dataset = False
try:
    dataset = tables_client.get_dataset(dataset_display_name=dataset_display_name)
except:
    new_dataset = True
    dataset = tables_client.create_dataset(dataset_display_name)

And next, give it the path to where the relevant data is in GCS (GCS file paths follow the format gs://BUCKET_NAME/file_path) and import your data.

In [None]:
# gcs_input_uris have the familiar path of gs://BUCKETNAME//file

if new_dataset:
    gcs_input_uris = ['gs://' + BUCKET_NAME + '/train.csv']

    import_data_operation = tables_client.import_data(
        dataset=dataset,
        gcs_input_uris=gcs_input_uris
    )
    print('Dataset import operation: {}'.format(import_data_operation))

    # Synchronous check of operation status. Wait until import is done.
    import_data_operation.result()
print(dataset)


## This dataset is too small that it has only 665 elements (minimum number is 1000 for Google Cloud AutoML)