# ___Project - Milestone 1___

###  <span style="color: gray;">Jade Chen, Sam Thorne, Dia Zavery</span> 

#### Dataset Background Information

Athlete non-athlete survey data can be found on [figshare.com](https://figshare.com/articles/dataset/Athlete_Non-Athlete_MH_Survey_-_ALL_DATA_csv/13035050)

Data collected from a mental health survey on 753 individuals. Data contains demographic information, general health and lifestyle information, athlete information, Mental health and answers to mental health related questions. This study was completed in early 2020 and questioned how individuals mental health was coping in the early stages of the COVID-19 pandemic.

# PART I: Initial Exploration

### Set Up

In [1]:
import pandas as pd
import numpy as np
import altair as alt

# Suppress FutureWarning
import warnings
warnings.filterwarnings("ignore")

### Read in Data

First we check the character type, then we read in information with proper encoding.

In [2]:
#Check character type
import chardet

with open('data/Athlete_Non-Athlete.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
result

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

In [3]:
#Used more robust encoding 'ISO-8859-1' instead of 'ascii' (got error)
data = pd.read_csv('data/Athlete_Non-Athlete.csv', encoding='ISO-8859-1')
data.head(5)

Unnamed: 0,Respondent ID,Gender:,Age Group:,Country During Lockdown,Mental Health Condition?,Occupation:,Marital Status:,Smoking Status:,Five Fruit and Veg,Hours sleep:,...,LONE_ TOTAL,LONE_ Emotional,LONE_ Social,I experience a general sense of emptiness,I miss having people around,There are many people I can trust completely*,I often feel rejected,There are enough people I feel close to*,There are plenty of people I can rely on when I have problems*,Unnamed: 84
0,11785667914,2,2,2,999,Unemployed,1,1,2,6.5,...,3.67,4.33,3.0,4,5,4,4,2,3,
1,11785634332,2,3,1,"3, 5, 6",Administrator,2,1,1,7.0,...,4.33,4.0,4.67,5,2,4,5,5,5,
2,11784520014,2,3,2,999,Finance,1,3,2,4.0,...,3.5,4.33,2.67,4,4,4,5,2,2,
3,11783867710,2,1,2,2,Unemployed,1,1,2,8.0,...,2.67,3.0,2.33,4,3,2,2,1,4,
4,11783726076,2,1,2,999,Student,1,1,1,8.5,...,4.33,5.0,3.67,5,5,5,5,4,2,


### Data Cleaning
Drop the last column with no values (empty).

In [4]:
data = data.drop(data.columns[84], axis=1)
data.head(5)

Unnamed: 0,Respondent ID,Gender:,Age Group:,Country During Lockdown,Mental Health Condition?,Occupation:,Marital Status:,Smoking Status:,Five Fruit and Veg,Hours sleep:,...,I tend to take a long time to get over setbacks in my life*,LONE_ TOTAL,LONE_ Emotional,LONE_ Social,I experience a general sense of emptiness,I miss having people around,There are many people I can trust completely*,I often feel rejected,There are enough people I feel close to*,There are plenty of people I can rely on when I have problems*
0,11785667914,2,2,2,999,Unemployed,1,1,2,6.5,...,2,3.67,4.33,3.0,4,5,4,4,2,3
1,11785634332,2,3,1,"3, 5, 6",Administrator,2,1,1,7.0,...,1,4.33,4.0,4.67,5,2,4,5,5,5
2,11784520014,2,3,2,999,Finance,1,3,2,4.0,...,3,3.5,4.33,2.67,4,4,4,5,2,2
3,11783867710,2,1,2,2,Unemployed,1,1,2,8.0,...,4,2.67,3.0,2.33,4,3,2,2,1,4
4,11783726076,2,1,2,999,Student,1,1,1,8.5,...,5,4.33,5.0,3.67,5,5,5,5,4,2


### Data Wrangling
First, we remove the colons (`:`) and question marks (`?`) in the column names.

Second, we transform the column data types to appropriate data types for ease of finding cardinality.

Lastly, we change cells with the values of `999` to `NaN`, because we assume that it means 'prefer not to answer'.

In [5]:
data.columns = data.columns.str.replace(r'[?:]$', '', regex=True)

In [6]:
categorical = ['Respondent ID', 'Gender', 'Age Group', 'Country During Lockdown', 'Mental Health Condition', 'Occupation', 'Marital Status', 'Smoking Status', 'Five Fruit and Veg', 'Shielded', '# in lockdown bubble', 'Athlete/Non-Athlete', 'What sport do you play', 'Individual/Team athlete']

for column_name in categorical:
    data[column_name] = data[column_name].astype('category')

temporal = ['Survey Date']

for column_name in temporal:
    data[column_name] = pd.to_datetime(data[column_name])

In [7]:
data.replace(999, np.nan, inplace=True)

## Data Attribute Information: Type, Cardinality, and Missing Values
The following dataframe gives the name, data type, cardinality (unique values for categorical features, and range for quantitative and temporal features), and missing values of each attribute.

In [8]:
pd.set_option('display.max_rows', None)
# Function to determine if a column is quantitative
def is_quantitative(column):
    return pd.api.types.is_numeric_dtype(column)

quant_range = []

# Loop through the columns of the original DataFrame
for column_name in data.columns:
    column_data = data[column_name]
    data_type = column_data.dtype
    if is_quantitative(column_data):
        data_range = f'{column_data.min()} - {column_data.max()}'
        unique_values = column_data.nunique() #TEMP
        #unique_values = 'N/A'
    else:
        data_range = 'N/A'
        unique_values = column_data.nunique()
        
    count_na = data[column_name].isna().sum()
    quant_range.append({'Column': column_name, 'Data Type': data_type, 'Unique Values': unique_values, 'Range': data_range, 'Missing Values': count_na})

# Convert the list of dictionaries to a DataFrame
quant_range_df = pd.DataFrame(quant_range)

# Display the DataFrame with data types and ranges
quant_range_df

Unnamed: 0,Column,Data Type,Unique Values,Range,Missing Values
0,Respondent ID,category,753,,0
1,Gender,category,2,,0
2,Age Group,category,7,,0
3,Country During Lockdown,category,7,,0
4,Mental Health Condition,category,12,,0
5,Occupation,category,405,,3
6,Marital Status,category,5,,0
7,Smoking Status,category,7,,0
8,Five Fruit and Veg,category,2,,0
9,Hours sleep,float64,18,1.5 - 10.5,0


In [9]:
#pd.reset_option('display.max_rows')

### Levels of Categorical Variables
Specifically, if the categorical variable has less than 10 levels, we can see the count of each level.

In [None]:
#Referenced from DSCI 320 Prog4
def cat_cardinality():
    for field in categorical:
        print(field, " ", data[field].nunique())
        if data[field].nunique() < 10:
            print(data[field].value_counts())

cat_cardinality()

### Semantics
TODO

### Insights

TODO

### Potential Challenges

TODO

## Exploratory Data Analysis

### Size of the Dataset
As we can see below, there are 753 rows and 84 columns.

In [12]:
data.shape

(753, 84)

### Numeric Summaries - Frequency Table (Categorical)

For all the categorical features, we have created a frequency table to show the distribution of values.

In [48]:
# Get a list of categorical columns (excluding 'Respondent ID')
categorical_columns = [col for col in data.select_dtypes(include=['category']).columns if col != "Respondent ID"]

# Create frequency tables for each categorical column
frequency_tables = {}
for column in categorical_columns:
    frequency_table = data[column].value_counts()
    frequency_tables[column] = frequency_table

# Display the frequency tables
for column, table in frequency_tables.items():
    print(table)
    print("\n")


Gender
2    400
1    353
Name: count, dtype: int64


Age Group
2    203
3    182
4    159
1     73
5     73
6     48
7     15
Name: count, dtype: int64


Country During Lockdown
1    558
2    186
4      4
3      2
5      1
6      1
7      1
Name: count, dtype: int64


Mental Health Condition
2          578
999         98
6           21
8           21
3, 6        15
3           12
5            2
7            2
3, 5, 6      1
3, 6, 7      1
4            1
6, 6         1
Name: count, dtype: int64


Occupation
Student                                                   83
Teacher                                                   46
Retired                                                   38
Civil servant                                             16
Lecturer                                                  13
Civil Servant                                             12
Accountant                                                10
Manager                                                    9


### Numeric Summaries - Frequency Table (Numerical)

For all the numerical features, we have created a frequency table to show the distribution of values.

In [50]:
numeric_columns_all = data.select_dtypes(include=['number']).columns.tolist()

# Create frequency tables for each categorical column
frequency_tables = {}
for column in numeric_columns_all:
    frequency_table = data[column].value_counts()
    frequency_tables[column] = frequency_table

# Display the frequency tables
for column, table in frequency_tables.items():
    print(table)
    print("\n")


Hours sleep
7.0     191
8.0     146
7.5     137
6.5      74
6.0      74
8.5      43
5.0      24
5.5      19
9.0      19
9.5       7
4.5       6
4.0       5
10.0      2
10.5      2
3.5       1
2.0       1
3.0       1
1.5       1
Name: count, dtype: int64


Weeks Social Distancing
5    319
4    174
6    106
7     67
3     49
2     20
0     14
1      4
Name: count, dtype: int64


AIMS_ TOTAL
5.14    33
6.14    32
5.43    31
5.29    26
6.00    26
5.86    26
5.57    25
5.00    25
6.29    24
4.86    23
4.57    20
5.71    18
4.43    17
4.14    16
4.71    15
3.86    14
4.29    13
6.43    12
4.00    11
3.71    10
6.57    10
6.71     9
2.86     9
6.86     8
7.00     8
1.00     7
3.29     6
1.43     5
1.14     4
2.14     4
3.57     4
3.43     3
3.14     3
2.71     3
1.29     3
2.00     3
2.43     3
3.00     2
1.86     2
2.57     2
2.29     1
1.57     1
1.71     1
Name: count, dtype: int64


Social Identity
6.33    66
5.67    65
6.00    56
5.33    46
5.00    44
6.67    39
4.67    38
7.00    28
4.3

### Numeric Summaries - Summary Statistics (Numerical)

Here we have generated summary statistitcs for all numerical columns.

In [56]:
data[numeric_columns_all].describe()

Unnamed: 0,Hours sleep,Weeks Social Distancing,AIMS_ TOTAL,Social Identity,I consider myself an athlete,I have many goals related to sport,most of my friends are athletes,Exclusivity,Sport is the most important part of my life,I spend more time thinking about sport than anything else,Negative Affectivity,I feel bad about myself when I do badly in sport,I would be very depressed if I were injured and could not compete in sport,Sport level,Total weekly playing hours,Weekly training hours,Weekly competing hours,MHC-SF OVERALL,Emotional Wellbeing,Happy,Interested in life,Satisfied,Social Wellbeing,That you had something important to contribute to society,That you belonged to a community (like a social group or your neighbourhood),That our society is becoming a better place for people like you,That people are basically good,That the way our society works makes sense to you,Psychological Wellbeing,That you liked most parts of your personality,Good at managing the responsibilities of your daily life,That you had warm and trusting relationships with others,That you had experiences that challenged you to grow and become a better person,Confident to think or express your own ideas and opinions,That your life has a sense of direction or meaning to it,HADS OVERALL,HADS-A AVERAGE,HADS-D AVERAGE,I feel tense or 'wound up',I still enjoy the things I used to enjoy,I get a sort of frightened feeling as if something awful is about to happen,I can laugh and see the funny side of things,Worrying thoughts go through my mind,I feel cheerful,I can sit at ease and feel relaxed,I feel as if I am slowed down,I get a sort of frightened feeling like 'butterflies' in my stomach,I have lost interest in my appearance,I feel restless as I have to be on the move,I look forward with enjoyment to things,I get sudden feelings of panic,I can enjoy a good book or radio or TV programme,RES_TOTAL,I tend to bounce back quickly after hard times,I have a hard time making it through stressful events*,It does not take me long to recover from a stressful event,It is hard for me to snap back when something bad happens*,I usually come through difficult times with little trouble,I tend to take a long time to get over setbacks in my life*,LONE_ TOTAL,LONE_ Emotional,LONE_ Social,I experience a general sense of emptiness,I miss having people around,There are many people I can trust completely*,I often feel rejected,There are enough people I feel close to*,There are plenty of people I can rely on when I have problems*
count,753.0,753.0,518.0,518.0,518.0,518.0,518.0,518.0,518.0,518.0,518.0,518.0,518.0,354.0,354.0,354.0,354.0,753.0,753.0,688.0,688.0,688.0,753.0,688.0,688.0,688.0,688.0,688.0,753.0,688.0,688.0,688.0,688.0,688.0,688.0,753.0,753.0,753.0,671.0,671.0,671.0,671.0,671.0,671.0,671.0,671.0,671.0,671.0,671.0,671.0,671.0,671.0,753.0,661.0,661.0,661.0,661.0,661.0,661.0,654.0,654.0,654.0,654.0,654.0,654.0,654.0,654.0,654.0
mean,7.15,4.76,4.96,5.09,5.12,5.53,4.64,4.58,4.72,4.43,5.16,5.06,5.25,1.43,9.22,7.41,1.81,1247.1,268.48,3.59,3.8,3.32,442.93,2.76,2.98,2.04,3.0,2.09,535.69,3.04,3.45,3.81,3.07,3.51,3.13,1534.59,768.48,766.1,1.31,0.65,1.15,0.4,1.27,0.63,1.02,1.22,0.79,1.03,1.41,0.67,0.86,0.54,750.68,3.86,3.31,3.48,3.43,3.33,3.5,2.64,2.93,2.36,2.42,3.76,2.69,2.6,2.12,2.28
std,1.04,1.31,1.31,1.46,1.8,1.5,1.7,1.62,1.7,1.76,1.53,1.74,1.65,0.7,5.75,4.62,2.96,3918.19,839.23,0.97,1.1,1.22,1400.11,1.56,1.57,1.62,1.33,1.49,1678.85,1.39,1.3,1.25,1.39,1.39,1.55,4355.65,2177.41,2178.24,0.79,0.72,0.92,0.63,0.93,0.7,0.79,0.86,0.76,0.95,0.88,0.82,0.86,0.74,1957.44,0.94,1.06,1.03,1.03,0.99,1.01,0.74,0.8,0.92,1.22,1.1,1.2,1.12,0.97,1.06
min,1.5,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.5,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,6.5,4.0,4.33,4.33,5.0,5.0,4.0,3.5,4.0,3.0,4.5,4.0,5.0,1.0,6.0,4.0,0.25,36.0,9.0,3.0,3.0,3.0,9.0,1.0,2.0,1.0,2.0,1.0,17.0,2.0,3.0,3.0,2.0,3.0,2.0,8.0,5.0,3.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,18.0,3.0,3.0,3.0,3.0,3.0,3.0,2.17,2.33,1.67,1.0,3.0,2.0,2.0,1.0,2.0
50%,7.0,5.0,5.14,5.33,6.0,6.0,5.0,5.0,5.0,5.0,5.5,5.0,6.0,1.0,8.0,6.0,1.5,48.0,12.0,4.0,4.0,4.0,14.0,3.0,3.0,2.0,3.0,2.0,22.0,3.0,4.0,4.0,3.0,4.0,3.0,13.0,8.0,5.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,22.0,4.0,3.0,4.0,4.0,3.0,4.0,2.5,3.0,2.33,2.0,4.0,2.0,2.0,2.0,2.0
75%,8.0,5.0,5.86,6.33,6.0,7.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,2.0,11.0,10.0,2.0,56.0,13.0,4.0,5.0,4.0,19.0,4.0,4.0,3.0,4.0,3.0,26.0,4.0,4.0,5.0,4.0,5.0,4.0,21.0,12.0,9.0,2.0,1.0,2.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,26.0,4.0,4.0,4.0,4.0,4.0,4.0,3.17,3.33,3.0,4.0,5.0,4.0,3.0,3.0,3.0
max,10.5,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,3.0,66.5,25.0,48.0,13986.0,2997.0,5.0,5.0,5.0,4995.0,5.0,5.0,5.0,5.0,5.0,5994.0,5.0,5.0,5.0,5.0,5.0,5.0,13986.0,6993.0,6993.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,5994.0,5.0,5.0,5.0,5.0,5.0,5.0,4.83,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


### Numeric Summaries - Central Tendency Measures (Numerical)
We have generated the mean, median, and mode for all numeric features.

In [52]:
for column in numeric_columns_all:
    # Calculate Mean, Median, and Mode for each feature
    mean = data[column].mean()
    median = data[column].median()
    mode = data[column].mode().values[0]

    # Print the central tendency measures
    print(f"{column}:")
    print(f"Mean: {mean}")
    print(f"Median: {median}")
    print(f"Mode: {mode}")
    print("\n")


Hours sleep:
Mean: 7.148738379814077
Median: 7.0
Mode: 7.0


Weeks Social Distancing:
Mean: 4.763612217795485
Median: 5.0
Mode: 5


AIMS_ TOTAL:
Mean: 4.963706563706563
Median: 5.14
Mode: 5.14


Social Identity:
Mean: 5.094594594594595
Median: 5.33
Mode: 6.33


I consider myself an athlete:
Mean: 5.121621621621622
Median: 6.0
Mode: 6.0


I have many goals related to sport:
Mean: 5.527027027027027
Median: 6.0
Mode: 6.0


most of my friends are athletes:
Mean: 4.635135135135135
Median: 5.0
Mode: 5.0


Exclusivity:
Mean: 4.575289575289576
Median: 5.0
Mode: 5.0


Sport is the most important part of my life:
Mean: 4.718146718146718
Median: 5.0
Mode: 5.0


I spend more time thinking about sport than anything else:
Mean: 4.4324324324324325
Median: 5.0
Mode: 5.0


Negative Affectivity:
Mean: 5.155405405405405
Median: 5.5
Mode: 6.0


I feel bad about myself when I do badly in sport:
Mean: 5.063706563706564
Median: 5.0
Mode: 5.0


I would be very depressed if I were injured and could not compete

### Numeric Summaries - Box Plots (Numerical)
There are a tremendous amount of numeric features, here we are focusing specifically on the 8 numerical features ones we think would be of most importance to this project.

In [42]:
box_plots = []

for column in numeric_columns:
    box_plot = alt.Chart(data).mark_boxplot().encode(
        y=alt.Y(f'{column}:Q', title=column)
    ).properties(
        title=f'{column}',
        width = 60
    )
    box_plots.append(box_plot)

# Concatenate the individual box plots into a single row
combined_box_plots = alt.hconcat(*box_plots)

combined_box_plots

### Visual Summaries - Histograms (Numerical)
There are a tremendous amount of numeric features, again, we are focusing specifically on the 8 numerical features ones we think would be of most importance to this project.

In [29]:
# Create histograms for each numeric column
histograms = []

for i, column in enumerate(numeric_columns):
    histogram = alt.Chart(data).mark_bar().encode(
        alt.X(column, bin=alt.Bin(maxbins=10), title=column),
        alt.Y('count()', title='Frequency')
    ).properties(
        width=200,
        height=150,
        title=f'{column}'
    )
    
    # Group histograms into rows of 4
    if i % 4 == 0:
        histograms.append([histogram])
    else:
        histograms[-1].append(histogram)

# Create a grid of subplots with 4 histograms per row
grid = []
for row in histograms:
    grid.append(alt.hconcat(*row))

# Combine the rows of subplots into a single Altair chart
histogram_chart = alt.vconcat(*grid)
histogram_chart


### Save Data

Save wrangled and cleaned data to use voyager. Which we will use for the following univariate and multivariate visual summaries.

In [25]:
data.to_csv('data/my_data.csv', index=False)

###  Univariate  Summaries

In [None]:
#TODO - add screenshots and description

### Multivaraite Summaries

In [44]:
#TODO - add screenshots and description

# PART II: Project Scope
## Introduction:
### **Athletes**, ***how does your training regime influence your mental wellbeing?***

Using the information collected from this survey, we plan to delve into how being an athlete can influence other aspects of your life. We are interested in picking apart how training hours, and mindset towards your sport can alter your mental health. Mainly, we want to know to what extent being an athlete can positively or negatively impact your daily wellbeing. 

The target audience for the visualizations we are going to create is other athletes seeking self-betterment. The goal is to communicate ways in which they can alter their training and/or mindset to positively influence other aspects of their lives. We hope to spread awareness about how training can impact your mental health both positively and negatively.

## Task Analysis:

#### Task 1)

Determine the range of hours of sleep athletes get versus the range of hours of sleep non-athletes get. Does age play into these ranges as well?

#### Task 2)

How does negative affectivity have any correlation with total weekly training hours, hours of sleep, or age group of the individual?

#### Task 3)

Are athletes who spend more time training/competing/playing better at managing other responsibilities?

#### Task 4)

How does psycological wellbeing relate to confidence for athletes and non-athletes? 

#### Task 5)

Do athletes or non-athletes cope with challenging experiences better, looking at negative affectivity, emotional wellbeing, etc. to better understanding ratings of challenging experiences.

# PART III: Visualization Ideas

TODO:

Preliminary Sketches

Write out each task and below each task, include the following

1. Three sketches (low fidelity) suited for the task

2. A critique of all three

3. Sketch (high fidelity) of the final one selected

4. How the sketch you selected adheres to theoretical principles you have been exposed to this term.


# PART IV: Next Steps

TODO:
Outline: List out the next 5 things that you plan to do as a group