<a href="https://colab.research.google.com/github/tsholofelo-mokheleli/ACIS-2023-New-Zealand/blob/main/Data_Pre_Processiong.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Data Preprocessing**

**Process Completed:**
* Removed unnecessary columns
* Renamed columns
* Cleaned the age outlier, and gender
* Label Encoding


**Load Libraries**

In [76]:
import pandas  as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')
cmap=sns.color_palette('Blues_r')

from sklearn.preprocessing import LabelEncoder

**Load Data**

In [77]:
mh2016 = pd.read_csv('OSMI 2016 Mental Health in Tech Survey Results.csv')

# Remove any whitespace from column names
mh2016.columns = mh2016.columns.str.strip()

mh2016.shape

(1433, 63)

**Define a list of columns to keep**

In [78]:
cols_to_keep = [
    'What is your age?',
    'What is your gender?',
    'What country do you live in?',
    'Have you had a mental health disorder in the past?',
    'Have you been diagnosed with a mental health condition by a medical professional?',
    'Do you currently have a mental health disorder?',
    'Do you have a family history of mental illness?',
    'Have you ever sought treatment for a mental health issue from a mental health professional?',
    'Are you self-employed?',
    'Do you believe your productivity is ever affected by a mental health issue?',
    'How many employees does your company or organization have?',
    'Is your employer primarily a tech company/organization?',
    'Is your primary role within your company related to tech/IT?',
    'Does your employer provide mental health benefits as part of healthcare coverage?',
    'Do you know the options for mental health care available under your employer-provided coverage?',
    'Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?',
    'Does your employer offer resources to learn more about mental health concerns and options for seeking help?',
    'Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?',
    'If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:',
    'Would you have been willing to discuss a mental health issue with your previous co-workers?',
    'Would you have been willing to discuss a mental health issue with your direct supervisor(s)?',
    'Do you feel that your employer takes mental health as seriously as physical health?',
    'Would you bring up a mental health issue with a potential employer in an interview?',
    'Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?',
    'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?'
]

**Find intersection between columns in DataFrame and columns to keep**

In [79]:
intersection = set(cols_to_keep).intersection(mh2016.columns)

# drop columns not in intersection
mh2016 = mh2016.drop(columns=set(mh2016.columns) - intersection)

mh2016.shape

(1433, 25)

**Rename Columns for Simplicity**

In [80]:
mh2016 = mh2016.rename(columns={
    'What is your age?': 'age',
    'What is your gender?': 'gender',
    'What country do you live in?': 'country',
    'Have you had a mental health disorder in the past?': 'past_mental_health',
    'Have you been diagnosed with a mental health condition by a medical professional?': 'mental_health_diagnosed',
    'Do you currently have a mental health disorder?': 'mental_health',
    'Do you have a family history of mental illness?': 'family_history',
    'Have you ever sought treatment for a mental health issue from a mental health professional?': 'treatment',
    'Are you self-employed?': 'self_employed',
    'Do you believe your productivity is ever affected by a mental health issue?': 'work_interfere',
    'How many employees does your company or organization have?': 'no_employees',
    'Is your employer primarily a tech company/organization?':'tech_company',
    'Is your primary role within your company related to tech/IT?': 'company_role',
    'Does your employer provide mental health benefits as part of healthcare coverage?': 'benefits',
    'Do you know the options for mental health care available under your employer-provided coverage?': 'care_options',
    'Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?': 'wellness_program',
    'Does your employer offer resources to learn more about mental health concerns and options for seeking help?': 'seek_help',
    'Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?': 'anonymity',
    'If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:': 'leave',
    'Would you have been willing to discuss a mental health issue with your previous co-workers?': 'coworkers',
    'Would you have been willing to discuss a mental health issue with your direct supervisor(s)?': 'supervisor',
    'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?': 'discuss_mh',
    'Do you feel that your employer takes mental health as seriously as physical health?': 'mental_importance',
    'Would you bring up a mental health issue with a potential employer in an interview?': 'mental_health_interview',
    'Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?': 'neg_consequence_coworker'
})

**Show the Update Dataframe**

In [81]:
mh2016.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1433 entries, 0 to 1432
Data columns (total 25 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   self_employed             1433 non-null   int64  
 1   no_employees              1146 non-null   object 
 2   tech_company              1146 non-null   float64
 3   company_role              263 non-null    float64
 4   benefits                  1146 non-null   object 
 5   care_options              1013 non-null   object 
 6   wellness_program          1146 non-null   object 
 7   seek_help                 1146 non-null   object 
 8   anonymity                 1146 non-null   object 
 9   leave                     1146 non-null   object 
 10  mental_importance         1146 non-null   object 
 11  neg_consequence_coworker  1146 non-null   object 
 12  discuss_mh                287 non-null    object 
 13  work_interfere            287 non-null    object 
 14  coworker

In [82]:
mh2016.shape

(1433, 25)

### **Data Cleaning**

In [83]:
mh = mh2016

**Age**

In [84]:
mh['age'].replace([mh['age'][mh['age'] < 15]], np.nan, inplace = True)
mh['age'].replace([mh['age'][mh['age'] > 80]], np.nan, inplace = True)

**Gender**

In [85]:
mh['gender'] = mh['gender'].str.strip().str.lower()
mh['gender'].unique()

array(['male', 'female', 'm', 'i identify as female.', 'bigender',
       'non-binary', 'female assigned at birth', 'f', 'woman', 'man',
       'fm', 'cis female', 'transitioned, m2f',
       'genderfluid (born female)', 'other/transfeminine',
       'female or multi-gender femme', 'female/woman', 'cis male',
       'male.', 'androgynous', 'male 9:1 female, roughly', nan,
       'male (cis)', 'other', 'nb masculine', 'cisgender female',
       'sex is male', 'none of your business', 'genderqueer', 'human',
       'genderfluid', 'enby', 'malr', 'genderqueer woman', 'mtf', 'queer',
       'agender', 'dude', 'fluid',
       "i'm a man why didn't you make this a drop down question. you should of asked sex? and i would of answered yes please. seriously how much text can this take?",
       'mail', 'm|', 'male/genderqueer', 'fem', 'nonbinary',
       'female (props for making this a freeform field, though)',
       'unicorn', 'male (trans, ftm)', 'cis-woman', 'cisdude',
       'genderflux de

In [86]:
mh['gender'].replace(['male','m', 'cis male',
                      'man',  'mail', 'male-ish', 'male (cis)',
                      'cis man', 'msle', 'malr', 'mal', 'maile', 'make','cisdude',
                      'man','male.','sex is male','dude','mail', 'm|',
                      "i'm a man why didn't you make this a drop down question. you should of asked sex? and i would of answered yes please. seriously how much text can this take?",
                      'cis-het male','masculino','cis hetero male','cis-male',"male (hey this is the tech industry you're talking about)", 'male, cis',
                      'ostensibly male', 'male, born with xy chromosoms', 'malel','cisgender male','let\'s keep it simple and say "male"', 'identify as male',
                      'masculine', 'cishet male', 'i have a penis','mostly male', 'male/he/him',
                      ], 'Male', inplace = True)

mh['gender'].replace(['female','f','woman','femail', 'cis female', 'cis-female/femme', 'femake', 'female (cis)',
                      'i identify as female.', 'female ','female assigned at birth ','fm',
                      'female/woman','cisgender female','fem','cis-woman','femalw','my sex is female.',
                      'female (cisgender)','f, cisgender', 'female-ish','i identify as female','cis-female',
                      'cis woman','cisgendered woman','gender non-conforming woman','female-identified','femmina',
                      '*shrug emoji* (f)','femile', 'female, she/her','female assigned at birth',
                      ], 'Female', inplace = True)

mh["gender"].replace(['female (trans)', 'queer/she/they', 'non-binary','transgender',
                     'fluid', 'queer', 'androgyne', 'trans-female', 'male leaning androgynous',
                      'agender', 'a little about you', 'nah', 'all',
                      'ostensibly male, unsure what that really means',
                      'genderqueer', 'enby', 'p', 'neuter', 'something kinda male?',
                      'guy (-ish) ^_^', 'trans woman','non-binary/agender','transitioned, m2f','genderfluid (born female)',
                      'other/transfeminine','androgynous','female or multi-gender femme','male 9:1 female, roughly',
                      'none of your business', 'genderqueer','human', 'genderfluid',
                      'genderqueer woman', 'mtf', 'fluid','male/genderqueer','nonbinary','other','nb masculine',
                      'female (props for making this a freeform field, though)','unicorn','male (trans, ftm)',
                      'genderflux demi-girl', 'female-bodied; no feelings about gender','woman-identified',
                      'uhhhhhhhhh fem genderqueer?','male/androgynous','afab','transgender woman','god king of the valajar',
                      'agender/genderfluid', 'sometimes','transfeminine','none','male (or female, or both)',
                      'trans man','bigender','contextual', 'non binary', 'genderqueer demigirl', 'genderqueer/non-binary',
                      'demiguy', 'trans female','she/her/they/them', 'swm', 'nb','nonbinary/femme', 'questioning','rr','agender trans woman',
                      'i am a wookie','trans non-binary/genderfluid', 'non-binary and gender fluid','afab non-binary', 'b','homem cis','female/gender non-binary.','female/gender non-binary.',
                       '43','\-',
                      ], 'Other', inplace = True)

In [87]:
mh['gender'].value_counts()

Male      1057
Female     337
Other       36
Name: gender, dtype: int64

In [88]:
mh.head()

Unnamed: 0,self_employed,no_employees,tech_company,company_role,benefits,care_options,wellness_program,seek_help,anonymity,leave,...,supervisor,mental_health_interview,family_history,past_mental_health,mental_health,mental_health_diagnosed,treatment,age,gender,country
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,Some of my previous employers,Maybe,No,Yes,No,Yes,0,39.0,Male,United Kingdom
1,0,6-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,...,Some of my previous employers,No,Yes,Yes,Yes,Yes,1,29.0,Male,United States of America
2,0,6-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,...,I don't know,Yes,No,Maybe,No,No,1,38.0,Male,United Kingdom
3,1,,,,,,,,,,...,Some of my previous employers,Maybe,No,Yes,Yes,Yes,1,43.0,Male,United Kingdom
4,0,6-25,0.0,1.0,Yes,Yes,No,No,No,Neither easy nor difficult,...,Some of my previous employers,No,Yes,Yes,Yes,Yes,1,43.0,Female,United States of America


**Label Encoding**

In [89]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Loop through all columns in the DataFrame and perform label encoding
for column in mh.columns:
    if column != 'age':
        # make small letters
        mh[column] = mh[column].apply(lambda x: x.lower() if isinstance(x, str) else x)
        mh[column] = label_encoder.fit_transform(mh[column])

In [90]:
mh.head()

Unnamed: 0,self_employed,no_employees,tech_company,company_role,benefits,care_options,wellness_program,seek_help,anonymity,leave,...,supervisor,mental_health_interview,family_history,past_mental_health,mental_health,mental_health_diagnosed,treatment,age,gender,country
0,0,2,1,2,2,3,1,1,0,5,...,2,0,1,2,1,1,0,39.0,1,49
1,0,4,1,2,1,2,2,2,2,3,...,2,1,2,2,2,1,1,29.0,1,50
2,0,4,1,2,1,3,1,1,0,1,...,0,2,1,0,1,0,1,38.0,1,49
3,1,6,2,2,4,3,3,3,3,6,...,2,0,1,2,2,1,1,43.0,1,49
4,0,4,0,1,3,2,1,1,1,1,...,2,1,2,2,2,1,1,43.0,0,50


In [91]:
mh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1433 entries, 0 to 1432
Data columns (total 25 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   self_employed             1433 non-null   int64  
 1   no_employees              1433 non-null   int64  
 2   tech_company              1433 non-null   int64  
 3   company_role              1433 non-null   int64  
 4   benefits                  1433 non-null   int64  
 5   care_options              1433 non-null   int64  
 6   wellness_program          1433 non-null   int64  
 7   seek_help                 1433 non-null   int64  
 8   anonymity                 1433 non-null   int64  
 9   leave                     1433 non-null   int64  
 10  mental_importance         1433 non-null   int64  
 11  neg_consequence_coworker  1433 non-null   int64  
 12  discuss_mh                1433 non-null   int64  
 13  work_interfere            1433 non-null   int64  
 14  coworker

**Drop null values**

In [92]:
mh = mh.dropna()

# Convert 'age' column to int64 data type
mh['age'] = mh['age'].astype('int64')

mh.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1430 entries, 0 to 1432
Data columns (total 25 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   self_employed             1430 non-null   int64
 1   no_employees              1430 non-null   int64
 2   tech_company              1430 non-null   int64
 3   company_role              1430 non-null   int64
 4   benefits                  1430 non-null   int64
 5   care_options              1430 non-null   int64
 6   wellness_program          1430 non-null   int64
 7   seek_help                 1430 non-null   int64
 8   anonymity                 1430 non-null   int64
 9   leave                     1430 non-null   int64
 10  mental_importance         1430 non-null   int64
 11  neg_consequence_coworker  1430 non-null   int64
 12  discuss_mh                1430 non-null   int64
 13  work_interfere            1430 non-null   int64
 14  coworkers                 1430 non-null 

**Create and Download a new dataset**

In [93]:
from google.colab import files

In [94]:
# Save the cleaned dataframe as a CSV file
mh.to_csv('Mental Health.csv', index=False)

# Download the CSV file
files.download('Mental Health.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>