# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset: https://osf.io/9gvmw/ 

Import the necessary libraries and create your dataframe(s).

In [1]:
#IMPORTING LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
#READING FILE
df = pd.read_csv('Gender-Science_IAT.public.2020.csv')
df.head()

Unnamed: 0,session_id,session_status,study_name,date,month,day,year,hour,weekday,birthmonth,...,sius003,sius004,sius005,sius006,sius007,sius008,sius009,sius010,sius011,sius012
0,2643542376,C,Demo.GenderScience.0003,1/1/2020 0:00:16,1,1,2020,0.0,4.0,4.0,...,,,,,,,,,,
1,2643542453,,Demo.GenderScience.0003,1/1/2020 0:36:15,1,1,2020,0.0,4.0,,...,,,,,,,,,,
2,2643542545,,Demo.GenderScience.0003,1/1/2020 1:28:05,1,1,2020,1.0,4.0,,...,,,,,,,,,,
3,2643542546,,Demo.GenderScience.0003,1/1/2020 1:28:35,1,1,2020,1.0,4.0,,...,,,,,,,,,,
4,2643542547,,Demo.GenderScience.0003,1/1/2020 1:29:06,1,1,2020,1.0,4.0,,...,,,,,,,,,,


## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [3]:
#REPLACING EMPTY VALUES WITH NAN VALUES
df.replace(r'^\s*$', np.nan, regex=True, inplace = True)

## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [4]:
#NA

## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [5]:
#DROPPING UNNECESSARY COLUMNS

unnecessary_columns = ['session_id','study_name','date','month','day','year',
'hour','weekday','num_002','edu','Mn_RT_all_3467','N_3467','PCT_error_3467',
'pct_300','pct_400','pct_2K','pct_3K','pct_4K','raceombmulti','ran9thboys','ran9thgirls',
'D_biep.Male_Science_36','D_biep.Male_Science_47','Mn_RT_all_3','Mn_RT_all_4',
'Mn_RT_all_6','Mn_RT_all_7','Side_Science_34','Side_Male_34','SD_all_3','SD_all_4','SD_all_6','SD_all_7','N_3',
'N_4','N_5','N_6','N_7','Mn_RT_correct_3','Mn_RT_correct_4','Mn_RT_correct_6',
'Mn_RT_correct_7','Order','SD_correct_3','SD_correct_4','SD_correct_6','SD_correct_7',
'N_ERROR_3','N_ERROR_4','N_ERROR_6','religion2014','religionid','iatevaluations001',
'iatevaluations002','iatevaluations003','broughtwebsite','user_id','previous_session_id',
'previous_session_schema','deathanxiety001','deathanxiety002','deathanxiety003','deathanxiety004',
'deathanxiety005','occuSelfDetail','birthmonth','session_status','deathanxiety006','deathanxiety007','deathanxiety008','deathanxiety009','deathanxiety010',
'deathanxiety011','deathanxiety012','deathanxiety013','deathanxiety014','deathanxiety015','fearcovid001',
'fearcovid002','N_ERROR_7','fearcovid003','fearcovid004','fearcovid005','fearcovid006','fearcovid007','fearcovid008',
'pvd001','pvd002','pvd003','pvd004','pvd005','pvd006','pvd007','pvd008','pvd009','pvd010','pvd011','pvd012',
'pvd013','pvd014','pvd015','sius001','sius002','sius003','sius004','sius005','sius006','sius007','sius008','sius009','sius010','sius011','sius012']

df = df.drop(unnecessary_columns, axis = 1)

## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [6]:
#NA

## Extra Cleaning


In [7]:
#CONVERTING DATATYPES
ignore_conversion = ['occuSelf','STATE','MSAName','genderIdentity']
        
for column in df.columns:
    if column in ignore_conversion:
        pass
    else:
        df[column] = df[column].astype(float)

In [8]:
#RENAMING COLUMNS
df = df.rename(columns = {'raceomb_002': 'race',
                    'ethnicityomb': 'ethnicity',
                    'edu_14':'highest education',
                    'D_biep.Male_Science_all':'iat score',
                    'larts_7':'la association',
                    'lscience_7':'sci association',
                    'goal1':'sci importance',
                    'goal2':'math importance',
                    'politicalid_7':'politicalid'
                    })

In [9]:
#TRANSLATING CODED DATA -- It was hard trying accurately represent all gender identities because there was 
#individuals that identified with more than two  identities. I made the decision to categorize people with multiple 
#identities as 'Gender Non-Conforming/Different Identity'; however, I understand that this may not be the best 
#representation/approach because gender is complex and often times difficult to reduce into a single category.

#GENDER IDENTITY
gender_identity = {'[1]':'Man',
                  '[2]':'Woman',
                  '[3]':'Transgender Man',
                  '[4]':'Transgender Woman',
                  '[5]':'Gender Non-Conforming/Different Identity',
                  '[6]':'Gender Non-Conforming/Different Identity',
                  '[1,3]':'Transgender Man',
                  '[2,4]':'Transgender Woman'}

df.replace({"genderIdentity": gender_identity}, inplace = True)

for value in df['genderIdentity']:
    if (value not in gender_identity.values()) and (pd.isna(value) != True):
        df.replace(value,'Gender Non-Conforming/Different Identity', inplace = True)
        
#OCCUPTATION
occupation = {'99-':'Unemployed/Retired',
             '9998':'Unemployed/Retired',
             '43-':'Administrative Support',
             '27-':'Arts/Design/Entertainment/Sports',
             '13-':'Business',
             '15-':'Computer/Math',
             '47-':'Construction/Extraction',
             '25-':'Education',
             '17-':'Engineers/Architects',
             '45-':'Farming, Fishing, Forestry',
             '35-':'Food Service',
             '29-':'Healthcare',
             '31-':'Healthcare',
             '00-':'Homemaker or Parenting',
             '23-':'Legal',
             '37-':'Maintenance',
             '11-':'Management',
             '55-':'Military',
             '51-':'Production',
             '33-':'Protective Services',
             '49-':'Repair/Installation',
             '41-':'Sales',
             '19-':'Science',
             '39-':'Service and Personal Care',
             '21-':'Social Service',
             '53-':'Transportation',
             '2931':'Healthcare'}

df.replace({"occuSelf": occupation}, inplace = True)

In [11]:
iat_data = df.to_csv('IAT DATA')

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?
2. Did the process of cleaning your data give you new insights into your dataset?
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?

1. No, I did not have all four types of dirty data.
2. When I initially looked at the dataset, I saw most of the columns as irrelevant but while I was working on it I started to realize that certain columns could reveal valuable information so I became more conservative with my cleaning process.
3. I would like to note that any visualizations done with this data is to show what the current data looks like, it is not necessarily something to be used to predict someone's bias (--I feel like more statistical analysis and modelling would need to be done before I could make something like that).