**My First Data Breakdown on Kaggle**

I'm really knew to the data scene and in the pursuit of a Masters in Data Science as I find the field very fascinating and powerful. In learning up python for Data Science, I came across Kaggle and found this dataset on diversity, which really hits close to home as I worked a lot with Inclusive Excellence and Diversity in my undergrad work. I was really interested in how the breakdown looks for these top tech companies. Note that this data is from 2016, though it would be nice to have data from other years to see if there's any shift in any one demographic.

I decided to use Plotly as my visualization library for Python as they're interactive and the user can modify what they would like to see. Utilize the legend to show/hide any part of the data to your liking.

This is still a work in progress and I want to add more analysis on male vs. female, by job category, and more!

In [12]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Interactive Data Visualization with plotly
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

To take a quick look at this DataFrame, there are six columns in this dataset:

**company**: Company name

**year**: For now, 2016 only

**race**: Possible values: "American_Indian_Alaskan_Native", "Asian", "Black_or_African_American", "Latino", "Native_Hawaiian_or_Pacific_Islander", "Two_or_more_races", "White", "Overall_totals"

**gender**: Possible values: "male", "female". Non-binary gender is not counted in EEO-1 reports.

**job_category**: Possible values: "Administrative support", "Craft workers", "Executive/Senior officials & Mgrs", "First/Mid officials & Mgrs", "laborers and helpers", "operatives", "Professionals", "Sales workers", "Service workers", "Technicians", "Previous_totals", "Totals"

**count**: Mostly integer values, but contains "na" for a no-data variable.

In [13]:
df = pd.read_csv('../input/Reveal_EEO1_for_2016.csv')
df.head()

Outputting the general info of all the columns, we see that the "gender" column has 264 entries less than all the other columns (3696 vs. 3960), so let's clean/omit these rows as we don't have data for them anyway.

In [14]:
df.info()

**Cleaning the Data**

We'll drop any rows with null values and any rows with 'na' in the count column because these won't be of any use to us. I also personally like using "Ethnicity" rather than "Race" so I renamed that column as well.

In [15]:
# Drop any null and 'na' in count column
df.dropna(inplace=True)
df = df[df['count']!='na']

# Change count column to int32
df['count'] = df['count'].astype(dtype='int32')

# I like using "Ethnicity" rather than "Race" in this context, so I'm going to rename that column
df.rename(columns={'race':'ethnicity'}, inplace=True)

**Total Employees for Each Company**

Before breaking down each company further, let's look at the overall look of the data in the total employees for each company.

In [16]:
all_sums = []

# Get totals for each company through "Totals" row in "job_category" column and summing them up
for comp in df.company.unique():
    s = df[(df.company==comp) & (df.job_category=='Totals')]['count'].sum()
    all_sums.append(s)

# Create new DataFrame for with totals for each company
sums = pd.DataFrame(columns=['Company', 'Total Employees'])
sums.Company = df.company.unique()
sums['Total Employees'] = all_sums

# Setup Data for Plotly chart
data = [
    go.Bar(
        x = sums['Total Employees'],
        y = sums.Company,
        orientation = 'h'
    )
]

# Setup Layout for Plotly chart
layout = go.Layout(
    title = 'Total Employees by Company',
    xaxis = {
        'title': '# of Employees'
    },
    yaxis = {
        'autorange': 'reversed',
        'title': 'Company'
    })

# Show Plotly Figure
fig = go.Figure(data=data, layout=layout)
iplot(fig)

As we can see, Apple, Intel, HPE, Google, and Cisco have significantly larger amounts of employees than all the other companies with 23andMe being the smallest.

# In-Depth Look at Diversity for Each Company

First, we'll group up the data by company and ethnicity, get rid of the year column since all the data is for 2016 anyway, and sum up the totals for each ethnicity. Then calculate the percentage of each ethnicity total with their respective company totals.

In [20]:
# Group by Company and Ethnicity, grab only the 'Totals' job_category rows and sum up the count for each ethnicity group into a new DataFrame
ethnicity = df[df.job_category=='Totals'].groupby(['company','ethnicity'], sort=False)['count'].sum().reset_index()

# Calculate percentage for each ethnicity group with the respective company total into a temporary DataFrame
tmp = ethnicity.apply(lambda row: (row.loc['count'] / sums[sums.Company==row.company]['Total Employees']) * 100, axis=1)

# Apply method created separate columns for each company so compress all columns into one
for i in range(1, len(tmp.columns)):
    tmp[0] = tmp[0].dropna().append(tmp[i].dropna()).reset_index(drop=True)
    
# Join percentage column with ethinicity dataframe
ethnicity = ethnicity.join(tmp[0])
ethnicity.rename(columns={0:'percentage'}, inplace=True)
ethnicity.head()

In [18]:
# Setup data for Plotly chart for each company into a list
trace = []
list_of_ethnicities = ethnicity.ethnicity.unique()

for comp in sums.Company:
    trace.append(go.Bar(
                    x = list_of_ethnicities,
                    y = ethnicity[ethnicity.company==comp]['percentage'],
                    name = comp
    ))

# Setup layout for Plotly Chart
layout = go.Layout(barmode = 'group',
                   title = 'Diversity Proportion Breakdown by Company',
                   xaxis = {
                       'title': 'Ethnicities'
                   },
                   yaxis = {
                       'title': 'Percentage %'
                   },
                   legend = {
                       'xanchor': 'auto'
                   })

# Show Plotly Figure
fig = go.Figure(data=trace, layout=layout)
iplot(fig)

In [19]:
for comp in sums.Company:

    # Setup data for each Plotly chart for each company
    data = [
        go.Pie(
            values = ethnicity[ethnicity.company==comp]['percentage'],
            labels = list_of_ethnicities,
            name = comp,
            hoverinfo = "label"
        )
    ]

    # Setup layout for Plotly Charts for each company
    layout = go.Layout(
        title = 'Diversity Breakdown for ' + comp
    )

    # Show Plotly Figure
    fig = go.Figure(data=data, layout=layout)
    iplot(fig)

**Gender Breakdown for Each Company**

In [21]:
gender = df[df.job_category=='Totals'].groupby(['company','gender'], sort=False)['count'].sum().reset_index()

# Calculate percentage for each ethnicity group with the respective company total
tmp = gender.apply(lambda row: (row.loc['count'] / sums[sums.Company==row.company]['Total Employees']) * 100, axis=1)

# Apply method created separate columns for each company so compress all columns into one
for i in range(1, len(tmp.columns)):
    tmp[0] = tmp[0].dropna().append(tmp[i].dropna()).reset_index(drop=True)
    
# Join percentage column with ethinicity dataframe
gender = gender.join(tmp[0])
gender.rename(columns={0:'percentage'}, inplace=True)

list_of_companies = sums.Company.unique()
data = [go.Bar(
              x = list_of_companies,
              y = gender[gender.gender=='male']['percentage'],
              marker = {
                  'color': 'rgb(55, 83, 109)'
              },
              name = 'Male'
        ),
        go.Bar(
              x = list_of_companies,
              y = gender[gender.gender=='female']['percentage'],
              marker = {
                  'color': 'rgb(26, 118, 255)'
              },
              name = 'Female'
        )]

layout = go.Layout(barmode = 'group',
                   title = 'Gender Percentage by Company',
                   xaxis = {
                       'title': 'Gender'
                   },
                   yaxis = {
                       'title': 'Percentage %',
                       'hoverformat': '.2f'
                   },
                   legend = {
                       'xanchor': 'auto'
                   })

fig = go.Figure(data=data, layout=layout)
iplot(fig)