# ___Project - Milestone 2___

###  <span style="color: gray;">Jade Chen, Sam Thorne, Dia Zavery</span> 

#### Dataset Background Information

Athlete non-athlete survey data can be found on [figshare.com](https://figshare.com/articles/dataset/Athlete_Non-Athlete_MH_Survey_-_ALL_DATA_csv/13035050)

Data collected from a mental health survey on 753 individuals. Data contains demographic information, general health and lifestyle information, athlete information, Mental health and answers to mental health related questions. This study was completed in early 2020 and questioned how individuals mental health was coping in the early stages of the COVID-19 pandemic.

### Set Up

In [1]:
import pandas as pd
import numpy as np
import altair as alt
from IPython.display import Image

# Suppress FutureWarning
import warnings
warnings.filterwarnings("ignore")

### Read in Data

First we check the character type, then we read in information with proper encoding.

In [2]:
#Check character type
import chardet

with open('data/Athlete_Non-Athlete.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
result

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

In [3]:
#Used more robust encoding 'ISO-8859-1' instead of 'ascii' (got error)
data = pd.read_csv('data/Athlete_Non-Athlete.csv', encoding='ISO-8859-1')
data.head(5)

Unnamed: 0,Respondent ID,Gender:,Age Group:,Country During Lockdown,Mental Health Condition?,Occupation:,Marital Status:,Smoking Status:,Five Fruit and Veg,Hours sleep:,...,LONE_ TOTAL,LONE_ Emotional,LONE_ Social,I experience a general sense of emptiness,I miss having people around,There are many people I can trust completely*,I often feel rejected,There are enough people I feel close to*,There are plenty of people I can rely on when I have problems*,Unnamed: 84
0,11785667914,2,2,2,999,Unemployed,1,1,2,6.5,...,3.67,4.33,3.0,4,5,4,4,2,3,
1,11785634332,2,3,1,"3, 5, 6",Administrator,2,1,1,7.0,...,4.33,4.0,4.67,5,2,4,5,5,5,
2,11784520014,2,3,2,999,Finance,1,3,2,4.0,...,3.5,4.33,2.67,4,4,4,5,2,2,
3,11783867710,2,1,2,2,Unemployed,1,1,2,8.0,...,2.67,3.0,2.33,4,3,2,2,1,4,
4,11783726076,2,1,2,999,Student,1,1,1,8.5,...,4.33,5.0,3.67,5,5,5,5,4,2,


### Data Wrangling
First, we remove the colons (`:`) and question marks (`?`) in the column names.

In [4]:
data.columns = data.columns.str.replace(r'[?:]$', '', regex=True)

### Data Cleaning
Drop the last column with no values (empty).

In [5]:
data = data.drop(data.columns[84], axis=1)
data.head(5)

Unnamed: 0,Respondent ID,Gender,Age Group,Country During Lockdown,Mental Health Condition,Occupation,Marital Status,Smoking Status,Five Fruit and Veg,Hours sleep,...,I tend to take a long time to get over setbacks in my life*,LONE_ TOTAL,LONE_ Emotional,LONE_ Social,I experience a general sense of emptiness,I miss having people around,There are many people I can trust completely*,I often feel rejected,There are enough people I feel close to*,There are plenty of people I can rely on when I have problems*
0,11785667914,2,2,2,999,Unemployed,1,1,2,6.5,...,2,3.67,4.33,3.0,4,5,4,4,2,3
1,11785634332,2,3,1,"3, 5, 6",Administrator,2,1,1,7.0,...,1,4.33,4.0,4.67,5,2,4,5,5,5
2,11784520014,2,3,2,999,Finance,1,3,2,4.0,...,3,3.5,4.33,2.67,4,4,4,5,2,2
3,11783867710,2,1,2,2,Unemployed,1,1,2,8.0,...,4,2.67,3.0,2.33,4,3,2,2,1,4
4,11783726076,2,1,2,999,Student,1,1,1,8.5,...,5,4.33,5.0,3.67,5,5,5,5,4,2


Drop columns that are not relevant to our project.

In [6]:
columns_to_drop = ['Respondent ID',
                   'Shielded',
                   'Dates Shielding',
                   'I consider myself an athlete',
       'I have many goals related to sport', 'most of my friends are athletes',
       'Exclusivity', 'Sport is the most important part of my life',
       'I spend more time thinking about sport than anything else',
       'I feel bad about myself when I do badly in sport',
       'I would be very depressed if I were injured and could not compete in sport',
       'Individual/Team athlete',
       'That you had something important to contribute to society',
       'That you belonged to a community (like a social group or your neighbourhood)',
       'That our society is becoming a better place for people like you',
       'That people are basically good',
       'That the way our society works makes sense to you',
       'That you liked most parts of your personality',
       'Good at managing the responsibilities of your daily life',
       'That you had warm and trusting relationships with others',
       'That you had experiences that challenged you to grow and become a better person',
       'Confident to think or express your own ideas and opinions',
       'That your life has a sense of direction or meaning to it',
       "I feel tense or 'wound up'",
       'I still enjoy the things I used to enjoy',
       'I get a sort of frightened feeling as if something awful is about to happen',
       'I can laugh and see the funny side of things',
       'Worrying thoughts go through my mind',
       'I feel cheerful',
       'I can sit at ease and feel relaxed', 'I feel as if I am slowed down',
       "I get a sort of frightened feeling like 'butterflies' in my stomach",
       'I have lost interest in my appearance',
       'I feel restless as I have to be on the move',
       'I look forward with enjoyment to things',
       'I get sudden feelings of panic',
       'I can enjoy a good book or radio or TV programme',
       'I tend to bounce back quickly after hard times',
       'I have a hard time making it through stressful events*',
       'It does not take me long to recover from a stressful event',
       'It is hard for me to snap back when something bad happens*',
       'I usually come through difficult times with little trouble',
       'I tend to take a long time to get over setbacks in my life*','I experience a general sense of emptiness',
       'I miss having people around',
       'There are many people I can trust completely*',
       'I often feel rejected', 'There are enough people I feel close to*',
       'There are plenty of people I can rely on when I have problems*',
                  'What sport do you play']

data.drop(columns=columns_to_drop, inplace=True)
data.head(5)

Unnamed: 0,Gender,Age Group,Country During Lockdown,Mental Health Condition,Occupation,Marital Status,Smoking Status,Five Fruit and Veg,Hours sleep,Survey Date,...,Satisfied,Social Wellbeing,Psychological Wellbeing,HADS OVERALL,HADS-A AVERAGE,HADS-D AVERAGE,RES_TOTAL,LONE_ TOTAL,LONE_ Emotional,LONE_ Social
0,2,2,2,999,Unemployed,1,1,2,6.5,13/07/2020,...,4,22.0,25.0,23.0,12.0,11.0,11.0,3.67,4.33,3.0
1,2,3,1,"3, 5, 6",Administrator,2,1,1,7.0,13/07/2020,...,2,1.0,2.0,36.0,20.0,16.0,6.0,4.33,4.0,4.67
2,2,3,2,999,Finance,1,3,2,4.0,12/07/2020,...,2,11.0,8.0,23.0,13.0,10.0,19.0,3.5,4.33,2.67
3,2,1,2,2,Unemployed,1,1,2,8.0,12/07/2020,...,4,14.0,24.0,14.0,7.0,7.0,24.0,2.67,3.0,2.33
4,2,1,2,999,Student,1,1,1,8.5,12/07/2020,...,3,2.0,17.0,29.0,15.0,14.0,17.0,4.33,5.0,3.67


### Data Wrangling
Again, we perform data wrangling. This time, we transform the column data types to appropriate data types for ease of finding cardinality.

Lastly, we change cells with the values of `999` to `NaN`, because we assume that it means 'prefer not to answer'.

In [7]:
categorical = ['Gender', 'Age Group', 'Country During Lockdown', 'Mental Health Condition', 'Occupation', 'Marital Status', 'Smoking Status', 'Five Fruit and Veg', 'Athlete/Non-Athlete']

for column_name in categorical:
    data[column_name] = data[column_name].astype('category')

temporal = ['Survey Date']

for column_name in temporal:
    data[column_name] = pd.to_datetime(data[column_name])

In [8]:
data.replace(999, np.nan, inplace=True)

## Data Attribute Information: Type, Cardinality, and Missing Values
The following dataframe gives the name, data type, cardinality (unique values for categorical features, and range for quantitative and temporal features), and missing values of each attribute.

In [9]:
pd.set_option('display.max_rows', None)
# Function to determine if a column is quantitative
def is_quantitative(column):
    return pd.api.types.is_numeric_dtype(column)

quant_range = []

# Loop through the columns of the original DataFrame
for column_name in data.columns:
    column_data = data[column_name]
    data_type = column_data.dtype
    if is_quantitative(column_data):
        data_range = f'{column_data.min()} - {column_data.max()}'
        unique_values = column_data.nunique() #TEMP
        #unique_values = 'N/A'
    else:
        data_range = 'N/A'
        unique_values = column_data.nunique()
        
    count_na = data[column_name].isna().sum()
    quant_range.append({'Column': column_name, 'Data Type': data_type, 'Unique Values': unique_values, 'Range': data_range, 'Missing Values': count_na})

# Convert the list of dictionaries to a DataFrame
quant_range_df = pd.DataFrame(quant_range)

# Display the DataFrame with data types and ranges
quant_range_df

Unnamed: 0,Column,Data Type,Unique Values,Range,Missing Values
0,Gender,category,2,,0
1,Age Group,category,7,,0
2,Country During Lockdown,category,7,,0
3,Mental Health Condition,category,12,,0
4,Occupation,category,405,,3
5,Marital Status,category,5,,0
6,Smoking Status,category,7,,0
7,Five Fruit and Veg,category,2,,0
8,Hours sleep,float64,18,1.5 - 10.5,0
9,Survey Date,datetime64[ns],21,,0


## Beginning visualization 

This visualization is going to be based on COVID. 

I was thinking the task could be to see how athletes versus non-athletes social well-being related to the number of people in their lockdown bubble. 

> social wellbeing is not working too great...
> there are extreme high values and extreme low values. I am not sure what they mean but there are no middle values. Also there are many VERY high numbers suggesting they aren't just a few outliers. 
> *I am going to pick a different x-axis value that has to do with mental health*

I was thinking weeks distancing could also be included as a slider or as the size of the point. 
- If slider, I am going to use color to encode athlete non-athlete
- If size for weeks then I am going to use drop down for athlete and non-athlete.

In [10]:
social_covid = alt.Chart(data).mark_point().encode(
    x = alt.X("LONE_ Social"),
    size = alt.Size("# in lockdown bubble"),
    color = alt.Color("Athlete/Non-Athlete")
)
social_covid

What if instead the question was about how for depending on the weeks spent distancing for athletes and non-athletes, how does the LONE social and emotional fair based on # of people in lockdown bubble?

In [11]:
dropdown = alt.binding_select(
    options=['LONE_ TOTAL', 'LONE_ Social', 'LONE_ Emotional'],
    name='X-axis column: '
)

xcol_param = alt.param(
    value='LONE_ TOTAL',
    bind=dropdown
)

selection = alt.selection_point(fields = ['Athlete/Non-Athlete'])
color = alt.condition(
    selection,
    alt.Color('Athlete/Non-Athlete:N').legend(None),
    alt.value('lightgray')
)

LONE_covid = alt.Chart(data).mark_point().encode(
    y = alt.Y("# in lockdown bubble:O"),
    x = alt.X('x:Q').title(''),
    size = alt.Size('count()').legend(None), # not sure if I want to include this at all
    color = color,
).transform_calculate(
    x = f'datum[{xcol_param.name}]'
).add_params(
    xcol_param
)

legend = alt.Chart(data).mark_point().encode(
    y = alt.Y('Athlete/Non-Athlete:N').axis(orient = 'right'),
    color = color
).add_params(
    selection
)

LONE_covid | legend

In [97]:
# READING IN DATA V2:
data2 = pd.read_csv('data/my_data(v2).csv')
data2.head()

data3 = data2.rename(columns = {'Athlete/Non-Athlete' : 'is_athlete'})


In [104]:
# Selects the attributes in the dropdown for x channel
dropdown = alt.binding_select(
    options=['LONE_ TOTAL', 'LONE_ Social', 'LONE_ Emotional'],
    name='X-axis: '
)

# makes a parameter for x-axis dropdown
xcol_param = alt.param(
    value='LONE_ TOTAL',
    bind=dropdown
)

# generates a slider, with range and steps
slider = alt.binding_range(min=0, max=7, step=1, name='Minimum Weeks Spent Social Distancing: ')
weeks = alt.param(value = 0, bind=slider)

selection = alt.selection_point(fields = ['Athlete/Non-Athlete'])
color = alt.condition(
    selection,
    alt.Color('Athlete/Non-Athlete:N').legend(None),
    alt.value('lightgray')
)

LONE_covid = alt.Chart(data).mark_point(filled = True, opacity = 0.5).encode(
    y = alt.Y("# in lockdown bubble:O", 
              title = ["Number of other people", "in lockdown bubble"],
             sort = alt.Sort('-y')),
    x = alt.X('x:Q').title(''),
    # not sure if I want to include this at all
    size = alt.Size('count()'), 
    color = color
).transform_calculate(
    x = f'datum[{xcol_param.name}]'
).add_params(
    xcol_param,
    weeks
).transform_filter(
    # alt.FieldEqualPredicate(field='Weeks Social Distancing', equal=weeks)
    alt.FieldGTEPredicate(field='Weeks Social Distancing', gte=weeks)
).properties(
    title = { 
        # TODO: FIX ALIGNMENT OF THE TITLE
        "text": ["Number of people in lockdown bubble as a function of varying LONE scores",
                 "for athletes and non-athletes"], 
        "subtitle": ["How do LONE scores change as weeks spent social distancing accumulates?",
                     "Is there a accumulation of athlete to non-athletes in the LONE distribution",
                     "Does the number of people spent in lockdown together shift the LONE scores?"]
    }
)

legend = alt.Chart(data).mark_point(filled = True).encode(
    y = alt.Y('Athlete/Non-Athlete:N').axis(orient = 'right'),
    color = color
).add_params(
    selection
)

LONE_covid | legend

# change shape instead of color.
# change the title of the slider.
# fix the axis for the slider changing

In [73]:
# Selects the attributes in the dropdown for x channel
dropdown = alt.binding_select(
    options=['LONE_ TOTAL', 'LONE_ Social', 'LONE_ Emotional'],
    name='X-axis: '
)

# makes a parameter for x-axis dropdown
xcol_param = alt.param(
    value='LONE_ TOTAL',
    bind=dropdown
)

# generates a slider, with range and steps
slider = alt.binding_range(min=0, max=7, step=1, name='Minimum Weeks Spent Social Distancing: ')
weeks = alt.param(value = 0, bind=slider)

selection = alt.selection_point(fields = ['Athlete/Non-Athlete'])
color = alt.condition(
    selection,
    alt.Color('Athlete/Non-Athlete:N').legend(None),
    alt.value('lightgray')
)

LONE_covid = alt.Chart(data).mark_point(filled = True, opacity = 0.5).encode(
    y = alt.Y("# in lockdown bubble:O", 
              title = ["Number of other people", "in lockdown bubble"],
             sort = alt.Sort('-y')),
    x = alt.X('x:Q').title(''),
    # not sure if I want to include this at all
    size = alt.Size('count()'), 
    color = color
).transform_calculate(
    x = f'datum[{xcol_param.name}]'
).add_params(
    xcol_param,
    weeks
).transform_filter(
    # alt.FieldEqualPredicate(field='Weeks Social Distancing', equal=weeks)
    alt.FieldGTEPredicate(field='Weeks Social Distancing', gte=weeks)
).properties(
    title = { 
        # TODO: FIX ALIGNMENT OF THE TITLE
        "text": ["Number of people in lockdown bubble as a function of varying LONE scores",
                 "for athletes and non-athletes"], 
        "subtitle": ["How do LONE scores change as weeks spent social distancing accumulates?",
                     "Is there a accumulation of athlete to non-athletes in the LONE distribution",
                     "Does the number of people spent in lockdown together shift the LONE scores?"]
    }
)

legend = alt.Chart(data).mark_point(filled = True).encode(
    y = alt.Y('Athlete/Non-Athlete:N').axis(orient = 'right'),
    color = color
).add_params(
    selection
)

LONE_covid | legend

In [117]:
# generates a slider, with range and steps
slider = alt.binding_range(min=0, max=7, step=1, name='Minimum Weeks Spent Social Distancing: ')
weeks = alt.param(value = 0, bind=slider)

# generates a drop down for athlete and non-athlete
input_dropdown = alt.binding_select(options=[None, 'Athlete','Non-Athlete'],
                                    labels = ['Both', 'Athlete','Non-Athlete'],
                                    name = 'Athlete type: ')
selection = alt.selection_single(fields=['is_athlete'],
                                 bind=input_dropdown)

# customizing colors of the variables
domain = ['Athlete', 'Non-Athlete']
range_ = ['#2459ed', '#eda724']

violin_lone = alt.Chart(data3).add_selection(
    selection
).transform_filter(
    selection
).transform_filter(
    alt.FieldGTEPredicate(field='Weeks Social Distancing', gte=weeks)
).transform_density(
    'LONE_ TOTAL',
    as_=['LONE_ TOTAL', 'density'],
    extent=[0, 5],
    groupby=['is_athlete', '# in lockdown bubble']
).mark_area(orient='horizontal', opacity=0.6).encode(
    y=alt.Y('LONE_ TOTAL:Q', title="TODO"),
    # color=color,
    color = alt.Color('is_athlete:N').scale(domain = domain, range = range_),
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=alt.Axis(labels=False, values=[0],grid=False, ticks=False),
    ),
    column=alt.Column(
        '# in lockdown bubble:O',
        header=alt.Header(
            titleOrient='bottom',
            labelOrient='bottom',
            labelPadding=0,
        ),
    )
).properties(
    width=100
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
).add_params(
    weeks
).configure_axis(
    grid = False
)

violin_lone