# Melur Erwina binti Mohamad Iskandar

## Research question/interests

**Is being self-sufficient the main core of self-care in leading a thriving life?**

This research question is exploring to find out if physical health is the foundation for a healthy mind. I'm assuming that a person who is able to take care of oneself would have a great self-discipline in navigating life, hence will be better in being productive in daily life. As the concept of healthy body equals to healthy mind, I’m planning to compare factors that constitute to physical health (Sleep_Hours, BMI, Fruit_veggies, Daily_Steps) to healthy mind factors (Todo_complete, Daily_stress).

### Analysis Plan:
1. I would only keep the columns of the factors  I’m interested in (Sleep_Hours, BMI_range, Fruit_veggies, Daily Steps, Todo_complete, Daily_stress) and demographic (Gender) for observation purposes.
1. I will then remove rows that contain Na values to draw a more solid observation and apply the dataframe describe function for an overview of the data.
1. Then, I would observe the distribution of (Fruit_veggies and  Daily_Steps) to (BMI_range) as a measure of healthy body, and distribution of (sleep_hours and BMI_range) to (todo_complete) to see if self-sufficient will drive motivation to accomplish things in life.
1. I will then observe the correlation between (todo_complete and daily_stress) to conclude whether an orderly and productive life would impose less stress and (BMI_range) to (daily_stress) to see if healthy lifestyle will lead to healthy mind.

## Loading Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("../data/raw/Wellbeing_and_lifestyle_data_Kaggle.csv")
df

## Milestone 3

### Task 1: Conduct an Exploratory Data Analysis (EDA) on your datase

In [None]:
dfa = df[["SLEEP_HOURS", "BMI_RANGE", "FRUITS_VEGGIES", "DAILY_STEPS", "TODO_COMPLETED", "DAILY_STRESS", "GENDER"]]
dfa

In [None]:
dfa.head()

In [None]:
dfa.dropna()

### Observation
- No NaN values recorded for the focused columns in the dataframe.

In [None]:
dfa.describe(include='number').T

In [None]:
dfa.describe(exclude='number').T

### Observation
- This table shows that there are 7 unique values of Daily Stress variables when in fact there are only 6 levels option of answers.
- This means that there is an unwanted value in the dataframe column.
- Therefore, this value must be recognised and dropped.

In [None]:
dfa["DAILY_STRESS"].unique()

In [None]:
dfa.loc[dfa["DAILY_STRESS"]=="1/1/00"]

In [None]:
dfa = dfa.drop([10005]).reset_index(drop=True)
dfa

### Observation
- One row is dropped from the dataframe due to an illogical value in Daily Stress column leaving 15971 rows.

In [None]:
dfa.describe(exclude='number').T

In [None]:
dfa["DAILY_STRESS"] = pd.to_numeric(dfa["DAILY_STRESS"])

In [None]:
dfa["BMI_RANGE"].astype('category', copy=False)

### Observation
- Change column **BMI_RANGE** and **TODO_COMPLETED** to categorical so the variable can be used as hue when plotting graphs. to be excluded from calulated numerically.

### Data Visualisation

In [None]:
sns.lineplot(x="DAILY_STEPS", y="FRUITS_VEGGIES",
             hue="BMI_RANGE", palette="muted",
             data=dfa)

### Observation
- The line graph shows that at each amount of daily steps, higher intake of fruits and vegetables result in lower BMI range (1; below 25 BMI count)
- This indicates that that fruits and vegetables plays a bigger role than daily steps in contributing to BMI index.

In [None]:
g = sns.catplot(
    data=dfa, x="DAILY_STEPS", y="FRUITS_VEGGIES",
    hue="BMI_RANGE", palette="muted", col="GENDER",
    capsize=.2, errorbar="se",
    kind="point", height=6, aspect=.75,
)
g.despine(left=True)

In [None]:
sns.lineplot(data=dfa, x="DAILY_STEPS", y="FRUITS_VEGGIES",
    hue="BMI_RANGE", palette="muted", col="GENDER",
    capsize=.2, errorbar="se",
    kind="point", height=6, aspect=.75,

### Observation
- Demographically. Male eat less fruits and vegetables than female.


In [None]:
sns.catplot(
    data=dfa, kind="bar",
    x="SLEEP_HOURS", y="TODO_COMPLETED", hue="BMI_RANGE", palette="dark")

### Observation
- Overall, BMI_RANGE level 1 (lower BMI count: not overweight) has higher completed task than BMI_RANGE level 2.
- The completed task increase as sleep hours increases from 1 to 8, then decrease as sleep hours go beyond 8.
- This graph shows that 8 hours of sleep with level 1 of BMI_RANGE has the highest productivity in daily life measured by how well one complete their tasks, followed by level 2 of BMI_RANGE, although not by much difference.
- This indicates that the ability to take care of oneself (optimum hours of sleep and BMI range) determine the level of productivity in daily as driven by self-discipline.

In [None]:
g = sns.catplot(dfa, x="SLEEP_HOURS", y="TODO_COMPLETED", hue="BMI_RANGE",
    palette="muted", col="GENDER", kind="bar"
)
g.despine(left=True)

### Observation
- Completed task peak at 8 hours of sleep for female and at 7 hours of sleep for male.

In [None]:
s = sns.boxplot(data=dfa, x="BMI_RANGE", y="DAILY_STRESS", hue="GENDER")
sns.move_legend(s, "upper left", bbox_to_anchor=(1, 1))

### Observation
- The boxplot graph shows that lower BMI count in males result in lower levels of daily stress compared to female.

In [None]:
sns.displot(
    dfa, x="DAILY_STRESS", col="GENDER", row="BMI_RANGE",
    binwidth=3, height=3, facet_kws=dict(margin_titles=True),
)

In [None]:
sns.countplot(data=dfa, x="DAILY_STRESS", hue="GENDER")

In [None]:
sns.

In [None]:
b = sns.boxplot(x="TODO_COMPLETED", y="DAILY_STRESS", hue="GENDER",
           palette=["m", "g"],
            data=dfa)
sns.despine(offset=10, trim=True)
sns.move_legend(b, "upper left", bbox_to_anchor=(1, 1))

### Observation
- The boxplot graph shows that having more completed task result in lower levels of daily stress in males while females shown to retain the stress level regardless of increase in number of completed task.
- Only at the highest number of completed task does females shown a bigger range with lower level of daily stress.

In [None]:
sns.displot(
    dfa, x="TODO_COMPLETED", col="DAILY_STRESS", row="GENDER", kind="hist",
    binwidth=3, height=3, facet_kws=dict(margin_titles=True),
)

#### Next Analysis
- More details on dissection of data visualisation

## Milestone 4

### Task 1: Set up an “Analysis Pipeline”

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import sys
sys.path.append("code")
from project_functions3 import load_and_process

dfc = load_and_process("../data/raw/Wellbeing_and_lifestyle_data_Kaggle.csv")

NameError: name 'null' is not defined

In [None]:
#Check file types
dfa.dtypes

### Observation 
- Column **BMI_RANGE** and **TODO_COMPLETED** are type int64.
- **BMI_RANGE** data are 1 and 2 with each data brackets a BMI range. They do not represent BMI value.
- Therefore, data type in **BMI_RANGE** needs to be changed to object to be excluded from being identiifed as numerical.
- This allows **BMI_RANGE** data to be used as hue when plotting graphs. 
- Column **TODO_COMPLETED** is changed to object type so it can be used as hue when plotting graphs. 

In [None]:
dfa['BMI_RANGE'] = dfa.BMI_RANGE.astype(object)
dfa['TODO_COMPLETED'] = dfa.TODO_COMPLETED.astype(object)

In [None]:
#Check file types again
dfa.dtypes

In [None]:
dfa.head()

### Observation
- No NaN values recorded for the focused columns in the dataframe.

In [None]:
dfa.describe(include='number').T

In [None]:
dfa.describe(exclude='number').T

### Observation
- Daily Stress column shows that there are 7 unique values when there are only 6 levels option of answers.
- This means that there is an unwanted value in the dataframe column.
- Therefore, the row containing this value must be recognised and dropped.

In [None]:
dfa["DAILY_STRESS"].unique()

In [None]:
dfa.loc[dfa["DAILY_STRESS"]=="1/1/00"]

In [None]:
dfa = dfa.drop([10005]).reset_index(drop=True)
dfa

### Observation
- One row is dropped from the dataframe due to an illogical value in Daily Stress column leaving 15971 rows.

### Task 2: Method Chaining and writing Python programs

In [None]:
# Step 1: Build and test your method chain(s)

import pandas as pd
import numpy as np

df = pd.read_csv("../data/raw/Wellbeing_and_lifestyle_data_Kaggle.csv")

# Method chaining begins

dfamc = (
    pd.DataFrame(df)
    .drop(columns=["PLACES_VISITED", "CORE_CIRCLE", "SUPPORTING_OTHERS", 
                    "SOCIAL_NETWORK", "ACHIEVEMENT", "DONATION", "LOST_VACATION", 
                    "DAILY_SHOUTING", "SUFFICIENT_INCOME", "PERSONAL_AWARDS", "TIME_FOR_PASSION", 
                    "WEEKLY_MEDITATION", "AGE", "WORK_LIFE_BALANCE_SCORE", "FLOW", "LIVE_VISION", 
                    "Timestamp"])
    .dropna()
    .assign(BMI_RANGE = lambda x: x['BMI_RANGE'].astype(object),
            TODO_COMPLETED = lambda x: x['TODO_COMPLETED'].astype(object))
    .drop([10005]) # data in index 10005 column DAILY_STRESS is irrelevent, so the row is dropped.
    .reset_index(drop=True)
)
      

dfamc

### Task 3: Conduct your analysis to help answer your research question(s)

## Research question/interests

**Is being self-sufficient the main core of self-care in leading a thriving life?**

This research question is exploring to find out if physical health is the foundation for a healthy mind. I'm assuming that a person who is able to take care of oneself would have a great self-discipline in navigating life, hence will be better in being productive in daily life. As the concept of healthy body equals to healthy mind, I’m planning to compare factors that constitute to physical health (Sleep_Hours, BMI, Fruit_veggies, Daily_Steps) to healthy mind factors (Todo_complete, Daily_stress).

### Analysis Plan:
1. I would only keep the columns of the factors  I’m interested in (Sleep_Hours, BMI_range, Fruit_veggies, Daily Steps, Todo_complete, Daily_stress) and demographic (Gender) for observation purposes.
1. I will then remove rows that contain Na values to draw a more solid observation and apply the dataframe describe function for an overview of the data.
1. Then, I would observe the distribution of (Fruit_veggies and  Daily_Steps) to (BMI_range) as a measure of healthy body, and distribution of (sleep_hours and BMI_range) to (todo_complete) to see if self-sufficient will drive motivation to accomplish things in life.
1. I will then observe the correlation between (todo_complete and daily_stress) to conclude whether an orderly and productive life would impose less stress and (BMI_range) to (daily_stress) to see if healthy lifestyle will lead to healthy mind.

### Data Visualisation

In [None]:
sns.lineplot(x="DAILY_STEPS", y="FRUITS_VEGGIES",
             hue="BMI_RANGE", palette="muted",
             data=dfa)

### Observation
- The line graph shows that at each amount of daily steps, higher intake of fruits and vegetables result in lower BMI range (1; below 25 BMI count)
- This indicates that that fruits and vegetables plays a bigger role than daily steps in contributing to BMI index.

In [None]:
g = sns.catplot(
    data=dfa, x="DAILY_STEPS", y="FRUITS_VEGGIES",
    hue="BMI_RANGE", palette="muted", col="GENDER",
    capsize=.2, errorbar="se",
    kind="point", height=6, aspect=.75,
)
g.despine(left=True)

In [None]:
sns.lineplot(data=dfa, x="DAILY_STEPS", y="FRUITS_VEGGIES",
    hue="BMI_RANGE", palette="muted", col="GENDER",
    capsize=.2, errorbar="se",
    kind="point", height=6, aspect=.75,

### Observation
- Demographically, male eat less fruits and vegetables than female.


In [None]:
sns.catplot(
    data=dfa, kind="bar",
    x="SLEEP_HOURS", y="TODO_COMPLETED", hue="BMI_RANGE", palette="dark")

### Observation
- Overall, BMI_RANGE level 1 (lower BMI count: not overweight) has higher completed task than BMI_RANGE level 2.
- The completed task increase as sleep hours increases from 1 to 8, then decrease as sleep hours go beyond 8.
- This graph shows that 8 hours of sleep with level 1 of BMI_RANGE has the highest productivity in daily life measured by how well one complete their tasks, followed by level 2 of BMI_RANGE, although not by much difference.
- This indicates that the ability to take care of oneself (optimum hours of sleep and BMI range) determine the level of productivity in daily as driven by self-discipline.

In [None]:
g = sns.catplot(dfa, x="SLEEP_HOURS", y="TODO_COMPLETED", hue="BMI_RANGE",
    palette="muted", col="GENDER", kind="bar"
)
g.despine(left=True)

### Observation
- Completed task peak at 8 hours of sleep for female and at 7 hours of sleep for male.

In [None]:
s = sns.boxplot(data=dfa, x="BMI_RANGE", y="DAILY_STRESS", hue="GENDER")
sns.move_legend(s, "upper left", bbox_to_anchor=(1, 1))

### Observation
- The boxplot graph shows that lower BMI count in males result in lower levels of daily stress compared to female.

In [None]:
sns.displot(
    dfa, x="DAILY_STRESS", col="GENDER", row="BMI_RANGE",
    binwidth=3, height=3, facet_kws=dict(margin_titles=True),
)

In [None]:
sns.displot(
    dfa, x="DAILY_STRESS", col="GENDER", row="BMI_RANGE",
    binwidth=3, height=3, facet_kws=dict(margin_titles=True),
)
sns.countplot(data=dfa, x="DAILY_STRESS", hue="GENDER")

In [None]:
sns.

In [None]:
b = sns.boxplot(x="TODO_COMPLETED", y="DAILY_STRESS", hue="GENDER",
           palette=["m", "g"],
            data=dfa)
sns.despine(offset=10, trim=True)
sns.move_legend(b, "upper left", bbox_to_anchor=(1, 1))

### Observation
- The boxplot graph shows that having more completed task result in lower levels of daily stress in males while females shown to retain the stress level regardless of increase in number of completed task.
- Only at the highest number of completed task does females shown a bigger range with lower level of daily stress.

In [None]:
sns.displot(
    dfa, x="TODO_COMPLETED", col="DAILY_STRESS", row="GENDER", kind="hist",
    binwidth=3, height=3, facet_kws=dict(margin_titles=True),
)