# ___Project - Milestone 2___

###  <span style="color: gray;">Jade Chen, Sam Thorne, Dia Zavery</span> 

#### Dataset Background Information

Athlete non-athlete survey data can be found on [figshare.com](https://figshare.com/articles/dataset/Athlete_Non-Athlete_MH_Survey_-_ALL_DATA_csv/13035050)

Data collected from a mental health survey on 753 individuals. Data contains demographic information, general health and lifestyle information, athlete information, Mental health and answers to mental health related questions. This study was completed in early 2020 and questioned how individuals mental health was coping in the early stages of the COVID-19 pandemic.

### Set Up

In [1]:
import pandas as pd
import numpy as np
import altair as alt
from IPython.display import Image

# Suppress FutureWarning
import warnings
warnings.filterwarnings("ignore")

### Read in Data

First we check the character type, then we read in information with proper encoding.

In [2]:
#Check character type
import chardet

with open('data/Athlete_Non-Athlete.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
result

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

In [3]:
#Used more robust encoding 'ISO-8859-1' instead of 'ascii' (got error)
data = pd.read_csv('data/Athlete_Non-Athlete.csv', encoding='ISO-8859-1')
data.head(5)

Unnamed: 0,Respondent ID,Gender:,Age Group:,Country During Lockdown,Mental Health Condition?,Occupation:,Marital Status:,Smoking Status:,Five Fruit and Veg,Hours sleep:,...,LONE_ TOTAL,LONE_ Emotional,LONE_ Social,I experience a general sense of emptiness,I miss having people around,There are many people I can trust completely*,I often feel rejected,There are enough people I feel close to*,There are plenty of people I can rely on when I have problems*,Unnamed: 84
0,11785667914,2,2,2,999,Unemployed,1,1,2,6.5,...,3.67,4.33,3.0,4,5,4,4,2,3,
1,11785634332,2,3,1,"3, 5, 6",Administrator,2,1,1,7.0,...,4.33,4.0,4.67,5,2,4,5,5,5,
2,11784520014,2,3,2,999,Finance,1,3,2,4.0,...,3.5,4.33,2.67,4,4,4,5,2,2,
3,11783867710,2,1,2,2,Unemployed,1,1,2,8.0,...,2.67,3.0,2.33,4,3,2,2,1,4,
4,11783726076,2,1,2,999,Student,1,1,1,8.5,...,4.33,5.0,3.67,5,5,5,5,4,2,


### Data Cleaning
Drop the last column with no values (empty).

In [4]:
data = data.drop(data.columns[84], axis=1)
data.head(5)

Unnamed: 0,Respondent ID,Gender:,Age Group:,Country During Lockdown,Mental Health Condition?,Occupation:,Marital Status:,Smoking Status:,Five Fruit and Veg,Hours sleep:,...,I tend to take a long time to get over setbacks in my life*,LONE_ TOTAL,LONE_ Emotional,LONE_ Social,I experience a general sense of emptiness,I miss having people around,There are many people I can trust completely*,I often feel rejected,There are enough people I feel close to*,There are plenty of people I can rely on when I have problems*
0,11785667914,2,2,2,999,Unemployed,1,1,2,6.5,...,2,3.67,4.33,3.0,4,5,4,4,2,3
1,11785634332,2,3,1,"3, 5, 6",Administrator,2,1,1,7.0,...,1,4.33,4.0,4.67,5,2,4,5,5,5
2,11784520014,2,3,2,999,Finance,1,3,2,4.0,...,3,3.5,4.33,2.67,4,4,4,5,2,2
3,11783867710,2,1,2,2,Unemployed,1,1,2,8.0,...,4,2.67,3.0,2.33,4,3,2,2,1,4
4,11783726076,2,1,2,999,Student,1,1,1,8.5,...,5,4.33,5.0,3.67,5,5,5,5,4,2


### Data Wrangling
First, we remove the colons (`:`) and question marks (`?`) in the column names.

Second, we transform the column data types to appropriate data types for ease of finding cardinality.

Lastly, we change cells with the values of `999` to `NaN`, because we assume that it means 'prefer not to answer'.

In [5]:
data.columns = data.columns.str.replace(r'[?:]$', '', regex=True)

In [6]:
categorical = ['Respondent ID', 'Gender', 'Age Group', 'Country During Lockdown', 'Mental Health Condition', 'Occupation', 'Marital Status', 'Smoking Status', 'Five Fruit and Veg', 'Shielded', '# in lockdown bubble', 'Athlete/Non-Athlete', 'What sport do you play', 'Individual/Team athlete']

for column_name in categorical:
    data[column_name] = data[column_name].astype('category')

temporal = ['Survey Date']

for column_name in temporal:
    data[column_name] = pd.to_datetime(data[column_name])

In [7]:
data.replace(999, np.nan, inplace=True)