# ___Project - Milestone 1___

###  <span style="color: gray;">Jade Chen, Sam Thorne, Dia Zavery</span> 

#### Background information for the data set:
Athlete non-athlete survey data can be found on [figshare.com](https://figshare.com/articles/dataset/Athlete_Non-Athlete_MH_Survey_-_ALL_DATA_csv/13035050)

Data collected from a mental health survey on 753 individuals. Data contains demographic information, general health and lifestyle information, athlete information, Mental health and answers to mental health related questions. This study was completed in early 2020 and questioned how individuals mental health was coping in the early stages of the COVID-19 pandemic.

# PART I: Initial Exploration

### Setup

In [19]:
import pandas as pd
import numpy as np
import altair as alt

# Suppress FutureWarning
import warnings
warnings.filterwarnings("ignore")

### Read in Data

In [20]:
#Check character type
import chardet

with open('Athlete_Non-Athlete.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
result

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

In [21]:
#Used more robust encoding 'ISO-8859-1' instead of 'ascii' (got error)
data = pd.read_csv('Athlete_Non-Athlete.csv', encoding='ISO-8859-1')
data.head(5)

Unnamed: 0,Respondent ID,Gender:,Age Group:,Country During Lockdown,Mental Health Condition?,Occupation:,Marital Status:,Smoking Status:,Five Fruit and Veg,Hours sleep:,...,LONE_ TOTAL,LONE_ Emotional,LONE_ Social,I experience a general sense of emptiness,I miss having people around,There are many people I can trust completely*,I often feel rejected,There are enough people I feel close to*,There are plenty of people I can rely on when I have problems*,Unnamed: 84
0,11785667914,2,2,2,999,Unemployed,1,1,2,6.5,...,3.67,4.33,3.0,4,5,4,4,2,3,
1,11785634332,2,3,1,"3, 5, 6",Administrator,2,1,1,7.0,...,4.33,4.0,4.67,5,2,4,5,5,5,
2,11784520014,2,3,2,999,Finance,1,3,2,4.0,...,3.5,4.33,2.67,4,4,4,5,2,2,
3,11783867710,2,1,2,2,Unemployed,1,1,2,8.0,...,2.67,3.0,2.33,4,3,2,2,1,4,
4,11783726076,2,1,2,999,Student,1,1,1,8.5,...,4.33,5.0,3.67,5,5,5,5,4,2,


### Data Cleaning
Drop the last column with no values (empty).

In [22]:
data = data.drop(data.columns[84], axis=1)
data.head(5)

Unnamed: 0,Respondent ID,Gender:,Age Group:,Country During Lockdown,Mental Health Condition?,Occupation:,Marital Status:,Smoking Status:,Five Fruit and Veg,Hours sleep:,...,I tend to take a long time to get over setbacks in my life*,LONE_ TOTAL,LONE_ Emotional,LONE_ Social,I experience a general sense of emptiness,I miss having people around,There are many people I can trust completely*,I often feel rejected,There are enough people I feel close to*,There are plenty of people I can rely on when I have problems*
0,11785667914,2,2,2,999,Unemployed,1,1,2,6.5,...,2,3.67,4.33,3.0,4,5,4,4,2,3
1,11785634332,2,3,1,"3, 5, 6",Administrator,2,1,1,7.0,...,1,4.33,4.0,4.67,5,2,4,5,5,5
2,11784520014,2,3,2,999,Finance,1,3,2,4.0,...,3,3.5,4.33,2.67,4,4,4,5,2,2
3,11783867710,2,1,2,2,Unemployed,1,1,2,8.0,...,4,2.67,3.0,2.33,4,3,2,2,1,4
4,11783726076,2,1,2,999,Student,1,1,1,8.5,...,5,4.33,5.0,3.67,5,5,5,5,4,2


### Size of the Dataset
As we can see below, there are 753 rows and 84 columns.

In [23]:
data.shape

(753, 84)

### Data Wrangling
First, we remove the colons (`:`) and question marks (`?`) in the column names.

Second, we transform the column data types to appropriate data types for ease of finding cardinality.

Lastly, we change cells with the values of `999` to `NaN`, because we assume that it means 'prefer not to answer'.

In [24]:
data.columns = data.columns.str.replace(r'[?:]$', '', regex=True)

In [25]:
categorical = ['Respondent ID', 'Gender', 'Age Group', 'Country During Lockdown', 'Mental Health Condition', 'Occupation', 'Marital Status', 'Smoking Status', 'Five Fruit and Veg', 'Shielded', '# in lockdown bubble', 'Athlete/Non-Athlete', 'What sport do you play', 'Individual/Team athlete']

for column_name in categorical:
    data[column_name] = data[column_name].astype('category')

temporal = ['Survey Date']

for column_name in temporal:
    data[column_name] = pd.to_datetime(data[column_name])

In [26]:
data.replace(999, np.nan, inplace=True)

### Data Attribute Information: Type, Cardinality, and Missing Values
The following dataframe gives the name, data type, cardinality (unique values for categorical features, and range for quantitative and temporal features), and missing values of each attribute.

In [27]:
pd.set_option('display.max_rows', None)
# Function to determine if a column is quantitative
def is_quantitative(column):
    return pd.api.types.is_numeric_dtype(column)

quant_range = []

# Loop through the columns of the original DataFrame
for column_name in data.columns:
    column_data = data[column_name]
    data_type = column_data.dtype
    if is_quantitative(column_data):
        data_range = f'{column_data.min()} - {column_data.max()}'
        unique_values = column_data.nunique() #TEMP
        #unique_values = 'N/A'
    else:
        data_range = 'N/A'
        unique_values = column_data.nunique()
        
    count_na = data[column_name].isna().sum()
    quant_range.append({'Column': column_name, 'Data Type': data_type, 'Unique Values': unique_values, 'Range': data_range, 'Missing Values': count_na})

# Convert the list of dictionaries to a DataFrame
quant_range_df = pd.DataFrame(quant_range)

# Display the DataFrame with data types and ranges
quant_range_df

Unnamed: 0,Column,Data Type,Unique Values,Range,Missing Values
0,Respondent ID,category,753,,0
1,Gender,category,2,,0
2,Age Group,category,7,,0
3,Country During Lockdown,category,7,,0
4,Mental Health Condition,category,12,,0
5,Occupation,category,405,,3
6,Marital Status,category,5,,0
7,Smoking Status,category,7,,0
8,Five Fruit and Veg,category,2,,0
9,Hours sleep,float64,18,1.5 - 10.5,0


In [28]:
#pd.reset_option('display.max_rows')

### Levels of Categorical Variables
Specifically, if the categorical variable has less than 10 levels, we can see the count of each level.

In [11]:
#Referenced from DSCI 320 Prog4
def cat_cardinality():
    for field in categorical:
        print(field, " ", data[field].nunique())
        if data[field].nunique() < 10:
            print(data[field].value_counts())

cat_cardinality()

Respondent ID   753
Gender   2
Gender
2    400
1    353
Name: count, dtype: int64
Age Group   7
Age Group
2    203
3    182
4    159
1     73
5     73
6     48
7     15
Name: count, dtype: int64
Country During Lockdown   7
Country During Lockdown
1    558
2    186
4      4
3      2
5      1
6      1
7      1
Name: count, dtype: int64
Mental Health Condition   12
Occupation   405
Marital Status   5
Marital Status
2    427
1    277
4     30
5     10
3      9
Name: count, dtype: int64
Smoking Status   7
Smoking Status
1    526
3     84
2     69
4     46
5     21
6      5
7      2
Name: count, dtype: int64
Five Fruit and Veg   2
Five Fruit and Veg
1    410
2    343
Name: count, dtype: int64
Shielded   2
Shielded
2    687
1     66
Name: count, dtype: int64
# in lockdown bubble   7
# in lockdown bubble
4    193
2    183
3    160
5    102
1     61
6     41
7     13
Name: count, dtype: int64
Athlete/Non-Athlete   2
Athlete/Non-Athlete
2    390
1    363
Name: count, dtype: int64
What sport do y

## Data Abstraction:
### <span style="color: lightblue;">Demographic Information</span>

|Attribute|Type|Cardinality|Note|
|---|---|---|---|
|Respondent ID| Nominal| 753| |
| Gender|Nominal|2| |
|Age group| Nominal| 7| |
|Country during lockdown|Nominal|7| |
|Mental health condition|Nominal|12| |
|Occupation|Nominal|405|Has 3 missing values|
|Marital status|Nominal|5| |
|Smoking status|Nominal|7| |

### <span style = "color:lightblue;">Health and lifestyle</span>

|Attribute|Type|Cardinality|Note|
|---|---|---|---|
|Five fruit and veg| Nominal| 2| |
|Hours of sleep| Quantitative| [1.5, 10.5]| |
|Survey date| Temporal| TODO| |
|Shielded| Nominal| 2| |
|Dates shielding| temporal| TODO| |
|Weeks social distancing| Quantitative| [0,7]| |
|In lockdown bubble| nominal| 7| |

### <span style = "color:lightblue;"> Athlete Information</span>

|Attribute|Type|Cardinality|Note|
|---|---|---|---|
|Athlete/Non-athlete| nominal|2| |
|AIMS_TOTAL| | | |
|Social Identity| | 7| |
|What sport do you play?| | | |
|Sport level| | | |
|Total weekly playing hours| | | | 
|Weekly training hours| | | |
|Weekly competing hours| | | |
|Individual/Team athlete?| | | |
|Self report questions| ordinal| | Numerous questions with same attribute and cardinality, so condensed together|

### <span style = "color:lightblue;"> Menthal Health and Wellbeing</span>

|Attribute|Type|Cardinality|Note|
|---|---|---|---|
|MHC-SF OVERALL| | | |
|Emotional Wellbeing| | | |
|Happy| | | |
|Interested in life| | | |
|Satisfied| | | |
|Social wellbeing| | | |
|HADS OVERALL | | | |
|HADS-A AVERAGE| | | |
|HADS-D AVERAGE| | | |
|RES_TOTAL| | | |
|LONE_TOTAL| | | |
|LONE_Emotional| | | |
|LONE_Social| | | |

# PART II: Project Scope
## Introduction:
### **Athletes**, ***how does your training regime influence your mental wellbeing?***

Using the information collected from this survey, we plan to delve into how being an athlete can influence other aspects of your life. We are interested in picking apart how training hours, and mindset towards your sport can alter your mental health. Mainly, we want to know to what extent being an athlete can positively or negatively impact your daily wellbeing. 

The target audience for the visualizations we are going to create is other athletes seeking self-betterment. The goal is to communicate ways in which they can alter their training and/or mindset to positively influence other aspects of their lives. We hope to spread awareness about how training can impact your mental health both positively and negatively.

## Task Analysis:

#### Task 1)

Determine the range of hours of sleep athletes get versus the range of hours of sleep non-athletes get. Does age play into these ranges as well?

#### Task 2)

How does negative affectivity have any correlation with total weekly training hours, hours of sleep, or age group of the individual?

#### Task 3)

Are athletes who spend more time training/competing/playing better at managing other responsibilities?

#### Task 4)

How does psycological wellbeing relate to confidence for athletes and non-athletes? 

#### Task 5)

Do athletes or non-athletes cope with challenging experiences better, looking at negative affectivity, emotional wellbeing, etc. to better understanding ratings of challenging experiences.