# ___Project - Milestone 2___

###  <span style="color: gray;">Jade Chen, Sam Thorne, Dia Zavery</span> 

#### Dataset Background Information

Athlete non-athlete survey data can be found on [figshare.com](https://figshare.com/articles/dataset/Athlete_Non-Athlete_MH_Survey_-_ALL_DATA_csv/13035050)

Data collected from a mental health survey on 753 individuals. Data contains demographic information, general health and lifestyle information, athlete information, Mental health and answers to mental health related questions. This study was completed in early 2020 and questioned how individuals mental health was coping in the early stages of the COVID-19 pandemic.

### Set Up

In [1]:
import pandas as pd
import numpy as np
import altair as alt
from IPython.display import Image

# Suppress FutureWarning
import warnings
warnings.filterwarnings("ignore")

### Read in Data

First we check the character type, then we read in information with proper encoding.

In [2]:
#Check character type
import chardet

with open('data/Athlete_Non-Athlete.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
result

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

In [3]:
#Used more robust encoding 'ISO-8859-1' instead of 'ascii' (got error)
data = pd.read_csv('data/Athlete_Non-Athlete.csv', encoding='ISO-8859-1')

### Data Cleaning
Drop the last column with no values (empty).

In [4]:
data = data.drop(data.columns[84], axis=1)

### Data Wrangling
First, we remove the colons (`:`) and question marks (`?`) in the column names.

In [5]:
data.columns = data.columns.str.replace(r'[?:]$', '', regex=True)

Transform the column data types to appropriate data types for ease of finding cardinality.

In [6]:
categorical = ['Gender', 'Age Group', 'Country During Lockdown', 'Mental Health Condition', 'Occupation', 'Marital Status', 'Smoking Status', 'Five Fruit and Veg', 'Athlete/Non-Athlete']

for column_name in categorical:
    data[column_name] = data[column_name].astype('category')

temporal = ['Survey Date']

for column_name in temporal:
    data[column_name] = pd.to_datetime(data[column_name])

We change cells with the values of `999` to `NaN`, because we assume that it means 'prefer not to answer'.

In [7]:
data.replace(999, np.nan, inplace=True)

Replace 1 with 'Male' and 2 with 'Female' in the `Gender` column.

In [8]:
data['Gender'] = data['Gender'].replace({1: 'Male', 2: 'Female'})

Replace 1 with 'Athlete' and 2 with 'Non-Athlete' in the `Athlete/Non-Athlete` column.

In [9]:
data['Athlete/Non-Athlete'] = data['Athlete/Non-Athlete'].replace({1: 'Athlete', 2: 'Non-Athlete'})

In [10]:
data.head(5)

Unnamed: 0,Respondent ID,Gender,Age Group,Country During Lockdown,Mental Health Condition,Occupation,Marital Status,Smoking Status,Five Fruit and Veg,Hours sleep,...,I tend to take a long time to get over setbacks in my life*,LONE_ TOTAL,LONE_ Emotional,LONE_ Social,I experience a general sense of emptiness,I miss having people around,There are many people I can trust completely*,I often feel rejected,There are enough people I feel close to*,There are plenty of people I can rely on when I have problems*
0,11785667914,Female,2,2,999,Unemployed,1,1,2,6.5,...,2.0,3.67,4.33,3.0,4.0,5.0,4.0,4.0,2.0,3.0
1,11785634332,Female,3,1,"3, 5, 6",Administrator,2,1,1,7.0,...,1.0,4.33,4.0,4.67,5.0,2.0,4.0,5.0,5.0,5.0
2,11784520014,Female,3,2,999,Finance,1,3,2,4.0,...,3.0,3.5,4.33,2.67,4.0,4.0,4.0,5.0,2.0,2.0
3,11783867710,Female,1,2,2,Unemployed,1,1,2,8.0,...,4.0,2.67,3.0,2.33,4.0,3.0,2.0,2.0,1.0,4.0
4,11783726076,Female,1,2,999,Student,1,1,1,8.5,...,5.0,4.33,5.0,3.67,5.0,5.0,5.0,5.0,4.0,2.0


## Data Attribute Information: Type, Cardinality, and Missing Values
The following dataframe gives the name, data type, cardinality (unique values for categorical features, and range for quantitative and temporal features), and missing values of each attribute.

In [11]:
pd.set_option('display.max_rows', None)
# Function to determine if a column is quantitative
def is_quantitative(column):
    return pd.api.types.is_numeric_dtype(column)

quant_range = []

# Loop through the columns of the original DataFrame
for column_name in data.columns:
    column_data = data[column_name]
    data_type = column_data.dtype
    if is_quantitative(column_data):
        data_range = f'{column_data.min()} - {column_data.max()}'
        unique_values = column_data.nunique() #TEMP
        #unique_values = 'N/A'
    else:
        data_range = 'N/A'
        unique_values = column_data.nunique()
        
    count_na = data[column_name].isna().sum()
    quant_range.append({'Column': column_name, 'Data Type': data_type, 'Unique Values': unique_values, 'Range': data_range, 'Missing Values': count_na})

# Convert the list of dictionaries to a DataFrame
quant_range_df = pd.DataFrame(quant_range)

# Display the DataFrame with data types and ranges
quant_range_df

Unnamed: 0,Column,Data Type,Unique Values,Range,Missing Values
0,Respondent ID,int64,753,11722163175 - 11785667914,0
1,Gender,category,2,,0
2,Age Group,category,7,,0
3,Country During Lockdown,category,7,,0
4,Mental Health Condition,category,12,,0
5,Occupation,category,405,,3
6,Marital Status,category,5,,0
7,Smoking Status,category,7,,0
8,Five Fruit and Veg,category,2,,0
9,Hours sleep,float64,18,1.5 - 10.5,0


### Changing column names and categorical values:

In [12]:
data3 = data.rename(columns = {'Athlete/Non-Athlete' : 'is_athlete'})


### Changing categorical values in the data

In [22]:
# removing spaces from column headers

data4 = data3.copy()

data4.columns = data3.columns.str.replace(' ', '')

data4['AgeGroup']=data4['AgeGroup'].replace(
    {
        1: '18-20', 2: '21-30', 3:'31-40', 4:'41-50', 5:'51-60', 6:'61-70', 7:'71+'
    }
)

# CountryDuringLockdown
data4['CountryDuringLockdown']=data4['CountryDuringLockdown'].replace(
    {
        1:'UK', 2:'Ireland', 3:'New Zealand', 
        4:'Australia', 5:'Thailand', 6:'Belgium', 7:'Sweden'
    }
)

# MaritalStatus
data4['MaritalStatus'] = data4['MaritalStatus'].replace(
    {
        1:'Single',
        2:'Married/Cohabiting',
        3:'Civil Partnership',
        4:'Divorced',
        5:'Widowed'
    }
)

#SmokingStatus
data4['SmokingStatus'] = data4['SmokingStatus'].replace(
    {
        1:'Never',
        2:'Ex-occasional smoker',
        3:'Ex-smoker',
        4:'Occasional',
        5:'Half pack daily',
        6:'Full pack daily',
        7:'Multiple packs daily'
    }
)

# FiveFruitandVeg
data4['FiveFruitandVeg'] = data4['FiveFruitandVeg'].replace(
    {
        1:'Yes',
        2:'No'
    }
)

# Dropping the 5994 rows
data4 = data4[data4['PsychologicalWellbeing'] != 5994]

# I need to bin hours of sleep
data4['Hourssleep'] = data4['Hourssleep'].replace(
        dict.fromkeys([1, 1.5, 2, 2.5, 3, 3.5],'< 4')).replace(
        dict.fromkeys([4, 4.5, 5, 5.5],'< 6')).replace(
        dict.fromkeys([6, 6.5, 7, 7.5],'< 8')).replace(
        dict.fromkeys([8, 8.5, 9, 9.5],'< 10')).replace(
        dict.fromkeys([10, 10.5],'10 +'))

# Changes weeks spent distancing intervals
data4['WeeksSocialDistancing']=data4['WeeksSocialDistancing'].replace({0:1, 2:4, 3:7, 4:10, 5:13, 6:16, 7:19, 8:21})

data4.head()

Unnamed: 0,RespondentID,Gender,AgeGroup,CountryDuringLockdown,MentalHealthCondition,Occupation,MaritalStatus,SmokingStatus,FiveFruitandVeg,Hourssleep,...,Itendtotakealongtimetogetoversetbacksinmylife*,LONE_TOTAL,LONE_Emotional,LONE_Social,Iexperienceageneralsenseofemptiness,Imisshavingpeoplearound,TherearemanypeopleIcantrustcompletely*,Ioftenfeelrejected,ThereareenoughpeopleIfeelcloseto*,ThereareplentyofpeopleIcanrelyonwhenIhaveproblems*
0,11785667914,Female,21-30,Ireland,999,Unemployed,Single,Never,No,< 8,...,2.0,3.67,4.33,3.0,4.0,5.0,4.0,4.0,2.0,3.0
1,11785634332,Female,31-40,UK,"3, 5, 6",Administrator,Married/Cohabiting,Never,Yes,< 8,...,1.0,4.33,4.0,4.67,5.0,2.0,4.0,5.0,5.0,5.0
2,11784520014,Female,31-40,Ireland,999,Finance,Single,Ex-smoker,No,< 6,...,3.0,3.5,4.33,2.67,4.0,4.0,4.0,5.0,2.0,2.0
3,11783867710,Female,18-20,Ireland,2,Unemployed,Single,Never,No,< 10,...,4.0,2.67,3.0,2.33,4.0,3.0,2.0,2.0,1.0,4.0
4,11783726076,Female,18-20,Ireland,999,Student,Single,Never,Yes,< 10,...,5.0,4.33,5.0,3.67,5.0,5.0,5.0,5.0,4.0,2.0


### Save to CSV

In [24]:
data.to_csv('data/my_data(v2).csv', index=False) #CHANGE CSV NAME AS NEEDED (VERSION)

# Another version with Athlete/Non-Athlete column title changed to is_athlete
data3.to_csv('data/my_data(v3).csv', index = False)

# Another version with all the categorical variables as strings
data4.to_csv('data/Sam_viz_data.csv', index = False)