# Checkpoint Two: Exploratory Data Analysis

Now that your chosen dataset is approved, it is time to start working on your analysis. Use this notebook to perform your EDA and make notes where directed to as you work.

## Getting Started

Since we have not provided your dataset for you, you will need to load the necessary files in this repository. Make sure to include a link back to the original dataset here as well.

My dataset:

Your first task in EDA is to import necessary libraries and create a dataframe(s). Make note in the form of code comments of what your thought process is as you work on this setup task.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


## Get to Know the Numbers

Now that you have everything setup, put any code that you use to get to know the dataframe and its rows and columns better in the cell below. You can use whatever techniques you like, except for visualizations. You will put those in a separate section.

When working on your code, make sure to leave comments so that your mentors can understand your thought process.

In [2]:
df = pd.read_csv('Gender-Science_IAT.public.2020.csv')
df.head()

Unnamed: 0,session_id,session_status,study_name,date,month,day,year,hour,weekday,birthmonth,...,sius003,sius004,sius005,sius006,sius007,sius008,sius009,sius010,sius011,sius012
0,2643542376,C,Demo.GenderScience.0003,1/1/2020 0:00:16,1,1,2020,0.0,4.0,4.0,...,,,,,,,,,,
1,2643542453,,Demo.GenderScience.0003,1/1/2020 0:36:15,1,1,2020,0.0,4.0,,...,,,,,,,,,,
2,2643542545,,Demo.GenderScience.0003,1/1/2020 1:28:05,1,1,2020,1.0,4.0,,...,,,,,,,,,,
3,2643542546,,Demo.GenderScience.0003,1/1/2020 1:28:35,1,1,2020,1.0,4.0,,...,,,,,,,,,,
4,2643542547,,Demo.GenderScience.0003,1/1/2020 1:29:06,1,1,2020,1.0,4.0,,...,,,,,,,,,,


In [3]:
#Replacing the empty spaces with NaN to calculate percentage of missing values.

df.replace(r'^\s*$', np.nan, regex=True, inplace = True)
for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

session_id - 0%
session_status - 52%
study_name - 0%
date - 0%
month - 0%
day - 0%
year - 0%
hour - 0%
weekday - 0%
birthmonth - 38%
birthyear - 41%
num_002 - 38%
birthSex - 37%
raceomb_002 - 44%
raceombmulti - 97%
ethnicityomb - 43%
edu - 41%
edu_14 - 41%
D_biep.Male_Science_all - 37%
Mn_RT_all_3467 - 37%
N_3467 - 37%
PCT_error_3467 - 37%
Order - 37%
Side_Science_34 - 37%
Side_Male_34 - 37%
pct_300 - 37%
pct_400 - 37%
pct_2K - 37%
pct_3K - 37%
pct_4K - 37%
arts - 38%
science - 38%
larts_7 - 38%
lscience_7 - 38%
factorability - 42%
factordiscrimination - 42%
factorencouragement - 42%
factorfamily - 42%
factorhighpower - 42%
factorinterest - 42%
genderIdentity - 37%
goal1 - 40%
goal2 - 40%
ran9thboys - 41%
ran9thgirls - 41%
D_biep.Male_Science_36 - 37%
D_biep.Male_Science_47 - 37%
Mn_RT_all_3 - 37%
Mn_RT_all_4 - 37%
Mn_RT_all_6 - 37%
Mn_RT_all_7 - 37%
SD_all_3 - 37%
SD_all_4 - 37%
SD_all_6 - 37%
SD_all_7 - 37%
N_3 - 37%
N_4 - 37%
N_5 - 37%
N_6 - 37%
N_7 - 37%
Mn_RT_correct_3 - 37%
Mn_RT

When a session was completed, it was given a "C" for completion. When it was not completed, it was left empty. Because 
we calculated the percentage of missing (or empty) values for each column, we know that 52% of people did not complete
the test. 

The "D_biep.Male_Science_all" column contains the IAT scores. Only 37% of the scores are missing, which is interesting because 52% of the participants did not complete the study. Maybe completion of the whole study is not needed to determine the IAT score? There are survey questions at the end of the study, so maybe this portion of the test determined whether the participant completed the study.

(I just realized this) Looking at the data dictionary, we can see that "D_biep.Male_Science_36" and "D_biep.Male_Science_47" represent the IAT for specific blocks in the study. When a participant completes a block in the study, an IAT score is generated. Only 37% of the "D_biep.Male_Science_36" and "D_biep.Male_Science_47" are missing values, and this seems to correlate with the 37% missing values in the "D_biep.Male_Science_all" column.

In [4]:
df.describe()

Unnamed: 0,session_id,month,day,year,hour,weekday,user_id
count,110724.0,110724.0,110724.0,110724.0,110723.0,110723.0,110723.0
mean,2645368000.0,3.535421,15.53273,2019.981937,13.551367,3.904338,290814.8
std,1286321.0,1.831973,8.765753,6.010479,7.060249,1.708363,1780973.0
min,2643542000.0,1.0,1.0,20.0,0.0,1.0,-1.0
25%,2644314000.0,2.0,8.0,2020.0,8.0,3.0,-1.0
50%,2645127000.0,3.0,15.0,2020.0,16.0,4.0,-1.0
75%,2646052000.0,5.0,23.0,2020.0,19.0,5.0,-1.0
max,2648229000.0,6.0,31.0,2020.0,23.0,7.0,11308730.0


One interesting observation from this table is that the data was only collected for 6 months. Besides that observation, this table isn't really helpful because we don't need summary statistics for these columns. The fact that summary statistics were only calculated for these columns when there are more columns containing numerical data means that we have to convert multiple columns during the data cleaning process.

In [30]:
#Mainly looking at data type for each column
df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110724 entries, 0 to 110723
Data columns (total 140 columns):
 #    Column                   Non-Null Count   Dtype  
---   ------                   --------------   -----  
 0    session_id               110724 non-null  float64
 1    session_status           53580 non-null   object 
 2    study_name               110724 non-null  object 
 3    date                     110724 non-null  object 
 4    month                    110724 non-null  float64
 5    day                      110724 non-null  float64
 6    year                     110724 non-null  float64
 7    hour                     110723 non-null  float64
 8    weekday                  110723 non-null  float64
 9    birthmonth               68167 non-null   object 
 10   birthyear                65413 non-null   object 
 11   num_002                  68407 non-null   object 
 12   birthSex                 70002 non-null   object 
 13   raceomb_002              62334 non-null   

In [80]:
df['D_biep.Male_Science_all'] = df['D_biep.Male_Science_all'].astype(float)
df['D_biep.Male_Science_all'].round(4).describe()

count    69921.000000
mean         0.290230
std          0.430688
min         -1.843300
25%          0.015800
50%          0.316300
75%          0.595500
max          1.816000
Name: D_biep.Male_Science_all, dtype: float64

"D_biep.Male_Science_all" represents the IAT scores. The IAT scores range from -2 to +2, with -2 representing a strong female-science association and +2 representing a strong male-science association. All scores above +0.65 or below -0.65 indicate a “strong” association. Looking at the statistics above, we can see that the average IAT score was 0.290, which means overall people have a slight inclination to associate men with science. We can also see that the median is slightly greater than the mean, so there might be a left skew. 

In [81]:
df['D_biep.Male_Science_all'].round(4).mode()

0    0.5426
dtype: float64

The mode is greater than the median and mean, so there is definitely a skew. (I would have created a box plot but it always resulted in a blank plot, and I couldn't get around that.) This also means that majority of pariticipants show a moderate (not strong) male-science association.

## Visualize

Create any visualizations for your EDA here. Make note in the form of code comments of what your thought process is for your visualizations.

Because most of the data needs to be cleaned and converted into different data types, I won't be able to provide many data visualizations at the moment. For now, I will only clean a few columns to provide some visualizations.

In [71]:
df['birthSex'] = df['birthSex'].astype(float)
df['birthyear'] = df['birthyear'].astype(float)
df['arts'] = df['arts'].astype(float)
df['science'] = df['science'].astype(float)
converted_data = df[['birthSex','birthyear','arts','science','D_biep.Male_Science_all']]
corr_data = converted_data.corr().round(5)
corr_data.style.background_gradient(cmap='coolwarm')

Unnamed: 0,birthSex,birthyear,arts,science,D_biep.Male_Science_all
birthSex,1.0,0.05569,0.12581,-0.1033,-0.06855
birthyear,0.05569,1.0,-0.13864,-0.13499,-0.10207
arts,0.12581,-0.13864,1.0,0.04286,0.09416
science,-0.1033,-0.13499,0.04286,1.0,-0.11968
D_biep.Male_Science_all,-0.06855,-0.10207,0.09416,-0.11968,1.0


Important Note: "Arts" represents one's attitude towards the arts. "Science" represents one's attitude towards science. Both variables are rated using the Likert scale.

I decided to use these columns for the correlation plot because I suspected that these variables might be correlated, but that was not the case! Maybe in the future, I will find a correlation between different variables after I clean the data.

## Summarize Your Results

With your EDA complete, answer the following questions.

1. Was there anything surprising about your dataset? 
2. Do you have any concerns about your dataset? 
3. Is there anything you want to make note of for the next phase of your analysis, which is cleaning data? 

1. I was suprised that there wasn't any correlation between the variables in the correlation plot. I thought we might have seen a gender difference between IAT scores.
2. I am slightly concerned by how much data I have to clean, but I think it is possible. 
3. Most of the columns holding important numerical values are coded as strings, so I am going to have to convert that. Also, there is a good portion of columns that came from the end-of-test survey, but I don't need them (e.g. fear of covid data, perceived vulnerability to disease data, intolerance to uncertainty data).